This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section


(Re-sending as a plain text without a text/html.)

Hi Sri,

Windows DLLs have the same problem and already solves it in their own
way, so let me share my knowledge about how DLLs work. It might be
useful for you.

On Windows, each DLL is linked against a fixed base address and
usually loaded to that address. However, if there's already another
DLL that overlaps in the address space, the Windows loader has to
relocate it. To do that, Windows DLLs contain ".reloc" sections which
contain offsets that need to be fixed up at runtime. If the Windows
loader find that a DLL cannot be loaded to its desired base address,
it loads it to somewhere else, and add <actual base address> -
<desired base address> to each offset that is specified by the .reloc
section.

In ELF terms, .reloc sections contains arrays of relocation offsets.
All these relocations in the section are implicitly R_X86_64_RELATIVE,
and addends are read from section contents (so it is REL as opposed to
RELA).

This already reduce the size of relocations to 1/3, but Windows does
more to reduce .reloc size (probably because it was invented for late
'80s or '90s PCs.) Offsets in .reloc are grouped by page where page
size is wordsize minus 16 bits, and offsets sharing the same page
address are stored consecutively to represent them with less space.
This is a very similar technique as the page table which is grouped by
(multiple stages of) pages.

For example, let's say we have 0x00030, 0x00500, 0x01000, 0x01100,
0x20004, and 0x20008 in a .reloc section. In the section, they are
represented like this:

  0x00000  -- takes 6 bytes
    0x0030 -- each takes 2 bytes
    0x0500
    0x1000
    0x1100
 0x20000 -- takes 6 bytes
    0x0004 -- each takes 2 bytes
    0x0008

In total, the .reloc section is 24 bytes. If you store each address as
8 byte integer, it takes 48 bytes, so this technique indeed reduces
the size of the .reloc section.

On Mon, Apr 24, 2017 at 6:31 PM, Peter Collingbourne <pcc@google.com> wrote:
> Hi Sriram,
>
> Thanks for sending this out. This is something we've been thinking
> about implementing in LLD and it would be good to get agreement on the
> appropriate representation.
>
> It may be appropriate to move this discussion to the generic-abi or
> gnu-gabi mailing lists, so that other parties are aware of your
> proposal.
>
> On Mon, Apr 24, 2017 at 5:22 PM, Sriraman Tallam via binutils
> <binutils@sourceware.org> wrote:
>>
>> We identified a problem with PIE executables, more than 5% code size
>> bloat compared to non-PIE and we have a few proposals to reduce the
>> bloat.  Please take a look and let us know what you think.
>>
>> * What is the problem?
>>
>> PIE is a security hardening feature that enables ASLR (Address Space
>> Layout Randomization) and enables the executable to be loaded at a
>> random virtual address upon every execution instance.
>>
>> On an average, a binary when built as PIE is larger by 5% to 9%, as
>> measured on a suite of benchmarks used at Google where the average
>> text size is ~100MB, when compared to the one built without PIE.  This
>> is also independent of the target architecture and we found this to be
>> true for x86_64, arm64 and power.  We noticed that the primary reason
>> for this code size bloat is due to the extra dynamic relocations that
>> are generated in order to make the binary position independent.  This
>> proposal introduces new ways to represent these dynamic relocations
>> that can reduce the code size bloat to just a few percent.
>>
>> As an example,  to show the bloat in code size, here is the data from
>> one of our larger  binaries,
>>
>> Without PIE, the binary’s code size in bytes is this as displayed by
>> the ‘size’ command:
>>
>>  text             data            bss           dec
>> 504663285 16242884 9130248 530036417
>>
>> With PIE, the binary’s code size in bytes is this as displayed by the
>> ‘size’ command:
>>
>>  text            data           bss           dec
>> 539781977 16242900 9130248 565155125
>>
>> The text size of the binary grew by 7% and the total size by 6.6%.
>> Our experiments have shown that the binary sizes grow anywhere from 5%
>> to 9%  with PIE on almost all benchmarks we looked at.
>>
>> Notice that almost all the code bloat comes from the “text” segment of
>> the binary, which contains the executable code of the application and
>> any read-only data.  We looked into this segment to see why this is
>> happening and found that the size of the section that contains the
>> dynamic relocations for a binary explodes with PIE.  For instance,
>> without PIE, for the above binary the dynamic relocation section
>> contains 46 entries whereas with PIE, the same section contains
>> 1463325 entries.  It takes 24 bytes to store one entry, that is 3
>> integer values each of size 8 bytes.  So, the dynamic relocations
>> alone need an extra space of (1463325 - 46) * 8 bytes which is 35
>> million bytes which is almost all the bloat incurred!.
>>
>> * What are these extra dynamic relocations that are created for PIE executables?
>>
>> We noticed that these extra relocations for PIE binaries have a common
>> pattern and are needed for the reason that it is not known until
>> run-time where the binary will be loaded.  All of these extra dynamic
>> relocations are of the ELF type R_X86_64_RELATIVE.   Let us show using
>> an example what these relocations do.
>>
>> Let us take an example of a program that stores the address of a global:
>>
>> #include <stdio.h>
>>
>> const int a = 10;
>>
>> const int *b = &a;
>>
>> int main() {
>>
>>  printf (“b = %p\n”, b);
>>
>> }
>>
>> First, let us look at the binary built without PIE.  Let’s look at the
>> data section where ‘b’ and ‘a’ are allocated.
>>
>> 00000000004007d0 <a>:
>>  4007d0:       0a 00
>>
>>
>> 0000000000401b10 <b>:
>>  401b10:       d0 07
>>  401b12:       40 00 00
>>
>> Variable ‘a’ is allocated at address 0x4007d0 which matches the output
>> when running the binary.  ‘b’ is allocated at address 0x401b10 and its
>> contents in little-endian byte order is the address of ‘a’.
>>
>> Now, lets us examine the contents of the PIE binary:
>>
>> 00000000000008d8 <a>:
>> 8d8:   0a 00
>>
>> 0000000000001c50 <b>:
>>    1c50:       d8 08
>>                     1c50: R_X86_64_RELATIVE *ABS*+0x8d8
>>    1c52:       00 00
>>    1c54:       00 00
>>
>>
>> Notice there is a dynamic relocation here which tells the dynamic
>> linker that this value needs to be fixed at run-time.  This is needed
>> because ASLR can load this binary anywhere in the address space and
>> this relocation fixes the address after it is loaded.
>>
>>
>> * More details about R_X86_64_RELATIVE relocations
>>
>> This relocation is worth 24 bytes  and has three fields
>>
>> Offset
>>
>> Type - here it is R_X86_64_RELATIVE
>>
>> Addend (what extra value needs to be added)
>>
>> The offset field of this relocation is the address offset from the
>> start where this relocation applies.  The type field indicates the
>> type of the dynamic relocation but we are interested in particularly
>> one type of dynamic relocation, R_X86_64_RELATIVE.   This is important
>> because in the motivating example that we presented above, all the
>> extra dynamic relocations were of this type!
>>
>>
>> * We have these proposals to reduce the size of the dynamic relocations section:
>>
>>
>> Idea A: Shrinking the size of the dynamic relocations
>>
>> 1. Use the offset of the relocation to store the addend
>>
>> 2. Create a separate section for storing R_X86_64_RELATIVE type
>> relocations, “.pie.dyn”.
>
> This doesn't seem like a very good name for the new section; the same
> scheme should also be useful for DSOs.
>
>> 3. Create a new relocation type, R_X86_64_RELATIVE_OFFSET that has
>> exactly one field, the offset.
>>
>> 4. The type field is not necessary as these is only one type of
>> relocation in this new section.
>>
>> 5. Store the addend at the offset, which has enough space to store 8
>> byte addends.
>>
>> 6. The dynamic linker can be thought to read the value from this
>> offset for the addend rather than reading this from the entry.
>>
>> 7. This new section is just a sequence of offsets which are
>> R_X86_64_RELATIVE relocations.  The dynamic linker needs to be taught
>> to apply these relocations correctly.
>
> You can make the observation that relative relocations are normally
> aligned to the machine's pointer size, and they normally pertain to
> consecutive entries in a table of addresses, such as a vtable. That
> hints at a more efficient scheme where instead of storing offsets you
> store the difference between the offset and the previous one, divided
> by the machine's pointer size. That, combined with a simple run-length
> encoding scheme (see below) should be enough. Any unaligned
> relocations can be represented using regular R_*_RELATIVE relocations.
>
>> 8. Figure out a way to either prevent this from running or warning at
>> startup when trying to run this program with old dynamic linkers.
>
> What I had in mind for backwards compatibility was that you could have
> the linker arrange for the binary to contain code that applies the
> relocations when the image is loaded, but only if the dynamic loader
> hasn't applied them itself.
>
> i.e. you could have something like
>
> static void *foo = &foo;
>
> void apply_relocs() {
>   if (foo != &foo) {
>     //apply relocations
>   }
> }
>
> such that apply_relocs would be the first thing called by the dynamic
> loader before the user's initialization code.
>
> Of course, if you didn't want to support old dynamic loaders at all,
> you could exit() inside the body of the if statement instead of
> applying relocations.
>
>> 9. Eliminating two-thirds of the relocation reduces the size of these
>> sections by 66%.  Additionally, this section can also be compressed
>> and a compression factor of 10 is possible which makes the total size
>> of these sections 3% to 4% of the original.
>
> I don't think you need any sort of standard compression for this. The
> run-length encoding scheme I had in mind would represent the first
> group of consecutive relocations in the file as a pair of values
> (offset of the relocation / machine pointer size, number of
> consecutive relocations), and subsequent groups as the pair (relative
> offset from previous group of consecutive relocations / machine
> pointer size, number of consecutive relocations) all using the uleb128
> encoding. So for example, the relocation list:
>
> 0xc8
> 0xd0
> 0xd8
> 0xe0
> 0x100
> 0x108
> 0x110
>
> would be encoded as: uleb128(0xc8 / 8) uleb128(4) uleb128((0x100-0xe0)
> / 8) uleb128(3)
>
> i.e. 4 bytes.
>
> In one experiment, I extracted the list of relative relocations from a
> Chromium binary and compressed them using this scheme. There were
> 90302 relocations in total, and this scheme compressed them down to
> 65488 bytes, which would be 3% of the original for a rela section
> (this experiment was actually conducted using an ARM binary, but I
> would expect similar results for x86_64).
>
> Thanks,
> Peter
>
>> 10. The decompression by the dynamic linker can be done in chunks to
>> keep the memory overhead fixed and small.  Notice that the
>> decompressed chunk’s contents can be discarded after the relocation is
>> applied.
>>
>> Idea B: Compressing the dynamic relocation section into .zrela.dyn
>>
>> 1. Simpler to implement.
>>
>> 2. The dynamic linker needs to decompress and apply these relocations
>> without  memory overhead, which otherwise defeats the purpose of doing
>> this, by decompressing in chunks.
>>
>> 3. The decompressed chunks can be discarded as and when the
>> relocations are applied so a fixed size buffer for decompression
>> should work well.
>>
>> 4. Figure out a way to either prevent this from running or warning at
>> startup when trying to run this program with unsupported dynamic
>> linkers.
>>
>> 5. Compression factors that can be achieved with gzip are a factor of
>> 10 which makes the total size 10% of the original.
>>
>>
>> Thanks to Sterling Augustine, Ian Lance Taylor, Paul Pluzhnikov, and
>> David Li for the numerous  inputs!  Please let us know what you think
>> and what would be suitable for implementation.
>>
>> Thanks
>> Sri


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]