x86-64: new CET-enabled PLT format proposal
Fangrui Song
i@maskray.me
Tue Mar 1 02:22:47 GMT 2022
On 2022-03-01, Rui Ueyama via Binutils wrote:
>I think size reduction matters to some users even if you do not care
>about that that much. But I'm not trying too hard to push GNU binutils
>to adopt it. I just wanted to let you guys know that we invented a
>compact (and we believe better) instruction sequence for the
>CET-enabled PLT and we are already using it.
>
>On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>> >
>> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > >
>> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
>> > > <binutils@sourceware.org> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I'd like to propose an alternative instruction sequence for the Intel
>> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is
>> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
>> > > > require a separate second PLT section (.plt.sec).
>> > > >
>> > > > Here is the proposed code sequence:
>> > > >
>> > > > PLT0:
>> > > >
>> > > > f3 0f 1e fa // endbr64
>> > > > 41 53 // push %r11
>> > > > ff 35 00 00 00 00 // push GOT[1]
>> > > > ff 25 00 00 00 00 // jmp *GOT[2]
>> > > > 0f 1f 40 00 // nop
>> > > > 0f 1f 40 00 // nop
>> > > > 0f 1f 40 00 // nop
>> > > > 66 90 // nop
>> > > >
>> > > > PLTn:
>> > > >
>> > > > f3 0f 1e fa // endbr64
>> > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d
>> > > > ff 25 00 00 00 00 // jmp *GOT[namen_index]
>> > >
>> > > All PLT calls will have an extra MOV.
>> >
>> > One extra load-immediate mov instruction is executed per a function
>> > call through a PLT entry. It's so tiny that I couldn't see any
>> > difference in real-world apps.
>> >
>> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
>> > > > PLT entry is called for the first time, the control is passed to PLT0 to call
>> > > > the resolver function.
>> > > >
>> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
>> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
>> > > > already clobbers it.
>> > > >
>> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
>> > > > preserved, nor is it used to pass arguments. Making this register available as
>> > > > scratch register means that code in the PLT need not spill any registers when
>> > > > computing the address to which control needs to be transferred."
>> > > >
>> > > > FYI, this is the current CET-enabled PLT:
>> > > >
>> > > > PLT0:
>> > > >
>> > > > ff 35 00 00 00 00 // push GOT[0]
>> > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
>> > > > 0f 1f 00 // nop
>> > > >
>> > > > PLTn in .plt:
>> > > >
>> > > > f3 0f 1e fa // endbr64
>> > > > 68 00 00 00 00 // push $namen_reloc_index
>> > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0
>> > > > 90 // nop
>> > > >
>> > > > PLTn in .plt.sec:
>> > > >
>> > > > f3 0f 1e fa // endbr64
>> > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
>> > > > 0f 1f 44 00 00 // nop
>> > > >
>> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
>> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
>> > > > have many PLT sections while we have only one header, so in practice, the
>> > > > proposed format is almost 50% smaller than the existing one.
>> > >
>> > > Does it have any impact on performance? .plt.sec can be placed
>> > > in a different page from .plt.
>> > >
>> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
>> > > > has been deprecated.
>> > > >
>> > > > I already implemented the proposed scheme to my linker
>> > > > (https://github.com/rui314/mold) and it looks like it's working fine.
>> > > >
>> > > > Any thoughts?
>> > >
>> > > I'd like to see visible performance improvements or new features in
>> > > a new PLT layout.
>> >
>> > I didn't see any visible performance improvement with real-world apps.
>> > I might be able to craft a microbenchmark to hammer PLT entries really
>> > hard in some pattern to see some difference, but I think that doesn't
>> > make much sense. The size reduction is for real though.
>>
>> I am aware that there are 2 other proposals to use R11 in PLT/function
>> call. But they are introducing new features. I don't think we should
>> use R11 in PLT without any real performance improvements.
I like the proposal. There are merits of simplified implementation,
code size reduction, and less obvious ones: (a) linker script users
won't need to mention .plt.sec (b) tools can use a more unified approach
identifying PLTs like other architectures.
>> > > I cced x86-64 psABI mailing list.
>> > >
>> > >
>> > > --
>> > > H.J.
>>
>>
>>
>> --
>> H.J.
More information about the Binutils
mailing list