Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

Sun Jan 1 00:00:00 GMT 2017

On Tue, Apr 25, 2017 at 11:02 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Apr 25, 2017 at 10:12 AM, Sriraman Tallam <tmsriram@google.com> wrote:
>> We identified a problem with PIE executables, more than 5% code size
>> bloat compared to non-PIE and we have a few proposals to reduce the
>> bloat.  Please take a look and let us know what you think.
>>
>> * What is the problem?
>>
>> PIE is a security hardening feature that enables ASLR (Address Space
>> Layout Randomization) and enables the executable to be loaded at a
>> random virtual address upon every execution instance. On an average, a
>> binary when built as PIE is larger by 5% to 9%, as measured on a suite
>> of benchmarks used at Google where the average text size is ~100MB,
>> when compared to the one built without PIE.  This is also independent
>> of the target architecture and we found this to be true for x86_64,
>> arm64 and power.  We noticed that the primary reason for this code
>> size bloat is due to the extra dynamic relocations that are generated
>> in order to make the binary position independent.  This proposal
>> introduces new ways to represent these dynamic relocations that can
>> reduce the code size bloat to just a few percent.
>>
>> As an example,  to show the bloat in code size, here is the data from
>> one of our larger  binaries,
>>
>> Without PIE, the binary’s code size in bytes is this as displayed by
>> the ‘size’ command:
>>
>>  text             data            bss           dec
>> 504663285 16242884 9130248 530036417
>>
>> With PIE, the binary’s code size in bytes is this as displayed by the
>> ‘size’ command:
>>
>>  text            data           bss           dec
>> 539781977 16242900 9130248 565155125
>>
>> The text size of the binary grew by 7% and the total size by 6.6%.
>> Our experiments have shown that the binary sizes grow anywhere from 5%
>> to 9%  with PIE on almost all benchmarks we looked at.  Notice that
>> almost all the code bloat comes from the “text” segment of the binary,
>> which contains the executable code of the application and any
>> read-only data.  We looked into this segment to see why this is
>> happening and found that the size of the section that contains the
>> dynamic relocations for a binary explodes with PIE.  For instance,
>> without PIE, for the above binary the dynamic relocation section
>> contains 46 entries whereas with PIE, the same section contains
>> 1463325 entries.  It takes 24 bytes to store one entry, that is 3
>> integer values each of size 8 bytes.  So, the dynamic relocations
>> alone need an extra space of (1463325 - 46) * 8 bytes which is 35
>> million bytes which is almost all the bloat incurred!.
>>
>> * What are these extra dynamic relocations that are created for PIE executables?
>>
>> We noticed that these extra relocations for PIE binaries have a common
>> pattern and are needed for the reason that it is not known until
>> run-time where the binary will be loaded.  All of these extra dynamic
>> relocations are of the ELF type R_X86_64_RELATIVE.   Let us show using
>> an example what these relocations do.
>> Let us take an example of a program that stores the address of a global:
>>
>> #include <stdio.h>
>>
>> const int a = 10;
>>
>> const int *b = &a;
>>
>> int main() {
>>
>>  printf (“b = %p\n”, b);
>>
>> }
>>
>> First, let us look at the binary built without PIE.  Let’s look at the
>> data section where ‘b’ and ‘a’ are allocated.
>>
>> 00000000004007d0 <a>:
>>  4007d0:       0a 00
>>
>>
>> 0000000000401b10 <b>:
>>  401b10:       d0 07
>>  401b12:       40 00 00
>>
>> Variable ‘a’ is allocated at address 0x4007d0 which matches the output
>> when running the binary.  ‘b’ is allocated at address 0x401b10 and its
>> contents in little-endian byte order is the address of ‘a’.
>>
>> Now, lets us examine the contents of the PIE binary:
>>
>> 00000000000008d8 <a>:
>> 8d8:   0a 00
>>
>> 0000000000001c50 <b>:
>>    1c50:       d8 08
>>                     1c50: R_X86_64_RELATIVE *ABS*+0x8d8
>>    1c52:       00 00
>>    1c54:       00 00
>>
>>
>> Notice there is a dynamic relocation here which tells the dynamic
>> linker that this value needs to be fixed at run-time.  This is needed
>> because ASLR can load this binary anywhere in the address space and
>> this relocation fixes the address after it is loaded.
>>
>>
>> * More details about R_X86_64_RELATIVE relocations
>>
>> This relocation is worth 24 bytes  and has three fields
>>
>> Offset
>>
>> Type - here it is R_X86_64_RELATIVE
>>
>> Addend (what extra value needs to be added)
>>
>> The offset field of this relocation is the address offset from the
>> start where this relocation applies.  The type field indicates the
>> type of the dynamic relocation but we are interested in particularly
>> one type of dynamic relocation, R_X86_64_RELATIVE.   This is important
>> because in the motivating example that we presented above, all the
>> extra dynamic relocations were of this type!
>>
>>
>> * We have these proposals to reduce the size of the dynamic relocations section:
>>
>
> There are 3 pieces of run-time relocation information:
>
> 1. Type and symbol. 4 or 8 bytes
> 2. Offset. 4 or 8 bytes
> 3. Addend.  4 or 8 bytes
>
> If we use REL instead of RELA, addend can be implicit and stored in-place.
> If we limit the type to relative relocation, we only need offset.
> This is for PIC,
> not just for PIE. An we can use special encoding scheme for offset table,
> which can be placed in DT_GNU_RELATIVE_REL with
> DT_GNU_RELATIVE_RELSZ.

I have not done an intrusive change like this before, so I am
wondering what are the various tools/pieces that  need to be modified.
Pointers to how to go about this would be really helpful. I can think
of these:

* Linker  - gold, lld, gnuld
* Dynamic Linker
* readelf
* objdump
* ABI changes - what is involved here?

Thanks
Sri

>
> --
> H.J.