[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section



Sri and I have been working on this over the past few months, and we've made
some good progress that we'd like to share and get feedback on.

Our work is based on the 'experimental-relr' prototype from Cary that is
available at 'users/ccoutant/experimental-relr' branch in the binutils
repository [1], and was described earlier in this thread:
https://sourceware.org/ml/gnu-gabi/2017-q2/msg00003.html

We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
custom encoding to compactly represent the list of offsets. We're calling the
new compressed section '.relrz.dyn' (for relocations-relative-compressed).

The encoding used is a simple combination of delta-encoding and a bitmap of
offsets. The section consists of 64-bit entries: higher 8-bits contain delta
since last offset, and lower 56-bits contain a bitmap for which words to apply
the relocation to. This is best described by showing the code for decoding the
section:

typedef struct
{
  Elf64_Xword  r_data;  /* jump and bitmap for relative relocations */
} Elf64_Relrz;

#define ELF64_R_JUMP(val)    ((val) >> 56)
#define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)

#ifdef DO_RELRZ
  {
    ElfW(Addr) offset = 0;
    for (; relative < end; ++relative)
      {
        ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
        ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
        offset += jump * sizeof(ElfW(Addr));
        if (jump == 0)
          {
            ++relative;
            offset = relative->r_data;
          }
        ElfW(Addr) r_offset = offset;
        for (; bits != 0; bits >>= 1)
          {
            if ((bits&1) != 0)
              elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
            r_offset += sizeof(ElfW(Addr));
          }
      }
  }
#endif

Note that the 8-bit 'jump' encodes the number of _words_ since last offset. The
case where jump would not fit in 8-bits is handled by setting jump to 0, and
emitting the full offset for the next relocation in the subsequent entry.

The above code is the entirety of the implementation for decoding and
processing '.relrz.dyn' sections in glibc dynamic loader.

This encoding can represent up to 56 relocation offsets in a single 64-bit
word. For many of the binaries we tested, this encoding provides >40x
compression for storing offsets over the original `.relr.dyn` section.

For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
the bitmap.


Here are three real world examples that demonstrate the savings:

1) Chrome browser (x86_64, built as PIE):
   File size (stripped): 152265064 bytes (145.21MB)
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         109256 bytes  (0.10MB) if moved to '.relrz.dyn' section

   Savings: 14159752 bytes, or 9.29% of original file size.


2) Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   File size (stripped): 10238168 bytes (9.76MB)
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
         43744 bytes (0.04MB) if moved to .relrz.dyn section

   Savings: 1967552 bytes, or 19.21% of original file size.


3) Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   File size (stripped): 3030032 bytes (2.89MB)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
         1992 bytes (0.00MB) if moved to .relrz.dyn section

   Savings: 148536 bytes, or 4.90% of original file size.

Recent releases of Debian, Ubuntu, and several other distributions build
executables as PIE by default. Suprateeka posted some statistics earlier in
this thread on the prevalence of relative relocations in executables residing
in /usr/bin: https://sourceware.org/ml/gnu-gabi/2017-q2/msg00013.html

The third example above shows that using '.relrz.dyn' sections to encode
relative relocations can bring decent savings to executable sizes in /usr/bin
across many distributions.

We have working ld.gold and ld.so implementations for arm, aarch64, and x86_64,
and would be happy to send patches to the binutils and glibc communities for
review.

However, before that can happen, we need agreement on the ABI side for the new
section type and the encoding. We haven't worked on a change of this magnitude
before that touches so many different pieces from the linker, elf tools, and
the dynamic loader. Specifically, we need agreement and/or guidance on where
and how should the new section type and its encoding be documented. We're
proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Thanks,
Rahul

[1]: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=shortlog;h=refs/heads/users/ccoutant/experimental-relr



On Mon, May 8, 2017 at 1:55 PM, Sriraman Tallam <tmsriram@google.com> wrote:
> +llvm-dev
>
> Discussion here: https://sourceware.org/ml/gnu-gabi/2017-q2/msg00000.html
>
> On Tue, May 2, 2017 at 10:17 AM, Suprateeka R Hegde
> <hegdesmailbox@gmail.com> wrote:
>> On 02-May-2017 12:05 AM, Florian Weimer wrote:
>>> On 05/01/2017 08:28 PM, Suprateeka R Hegde wrote:
>>>> So the ratio shows ~96% is RELATIVE reloc. And only ~4% others. This is
>>>> not the case on HP-UX/Itanium. But as I said, this comparison does not
>>>> make sense as the runtime architecture and ISA are totally different.
>>>
>>> It could be that HP-UX was written in a way to reduce relative
>>> relocations,
>>
>> Rather, the Itanium runtime architecture itself provides a way to reduce
>> them.
>>
>>> or that the final executables aren't actually PIC anymore.
>>
>> I was referring to shlibs (PIC) on HP-UX and it was implicit in my mind.
>> Sorry for that.
>>
>> I just built a large C++ shlib both on HP-UX/Itanium with our aCC
>> compiler and Linux x86-64 using GCC-6.2. The sources are almost same
>> with only a couple of lines differing between platforms.
>>
>> (HP-UX/Linux)
>> Total:    12224/38311
>> RELATIVE: 18/6397
>>
>> I will try to check the reason for such a huge difference in RELATIVE
>> reloc count. It might be useful for this discussion (just a guess)
>>
>> --
>> Supra