x86-64: new CET-enabled PLT format proposal

Tue Mar 1 09:25:47 GMT 2022

On Tue, Mar 1, 2022 at 6:17 PM Joao Moreira <joao@overdrivepizza.com> wrote:
>
> On 2022-02-28 16:04, H.J. Lu wrote:
> > On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
> >>
> >> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> >
> >> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> >> > <binutils@sourceware.org> wrote:
> >> > >
> >> > > Hello,
> >> > >
> >> > > I'd like to propose an alternative instruction sequence for the Intel
> >> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> >> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> >> > > require a separate second PLT section (.plt.sec).
> >> > >
> >> > > Here is the proposed code sequence:
> >> > >
> >> > >   PLT0:
> >> > >
> >> > >   f3 0f 1e fa        // endbr64
> >> > >   41 53              // push %r11
> >> > >   ff 35 00 00 00 00  // push GOT[1]
> >> > >   ff 25 00 00 00 00  // jmp *GOT[2]
> >> > >   0f 1f 40 00        // nop
> >> > >   0f 1f 40 00        // nop
> >> > >   0f 1f 40 00        // nop
> >> > >   66 90              // nop
> >> > >
> >> > >   PLTn:
> >> > >
> >> > >   f3 0f 1e fa        // endbr64
> >> > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> >> > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
> >> >
> >> > All PLT calls will have an extra MOV.
> >>
> >> One extra load-immediate mov instruction is executed per a function
> >> call through a PLT entry. It's so tiny that I couldn't see any
> >> difference in real-world apps.
>
> (also replying to Fangrui, whose e-mail, for whatever reason, did not
> come to this mailbox).
>
> I can see the benefits of having 16 byte/single plt entries. Yet, the
> R11 clobbering on every PLT transition is not amusing... If we want PLT
> entries to have only 16 bytes and not have a sec.plt section, maybe we
> could try:
>
> <plt_header>
> pop %r11
> sub %r11d, plt_header
> shr $0x5, %r11
> push %r11
> jmp _dl_runtime_resolve_shstk_thunk
>
> <foo>:
> endbr // 4b
> jmp GOT[foo] // 6b
> call plt_header // 5b

This is what I tried first but I then realized that I needed to insert
another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
to `endbr` if CET is enabled, so it can't directly jump to the
following `call`.

> Here, the plt entry has 16 bytes and it pushes the PLT entry address to
> the stack by calling it. The address is then popped in the plt_header
> and worked to retrieve the index by subbing the plt offset from the
> address and then dividing it by 16. Then, the final step to make it
> shstk compatible is jmping to a special implementation of
> _dl_runtime_resolve (shstk_thnk) which will have the following snippet
> (similarly to glibc's __longjmp):
>
> testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
> jz 1
> mov $1, %r11
> incsspq %r11
> 1:
> jmp _dl_runtime_resolve
>
> I don't think the above test fits along with the other instructions in
> the plt_header if we want it 32b at most, thus the suggestion for having
> it as a __dl_runtime_resolve thunk. Another possibility is to also
> resolve the relocation to the special thunk only if shstk is in place,
> if not, resolve it directly to _dl_runtime_resolve to prevent resolving
> overheads in the absence of shstk.
>
> I think this solves both the size and the dummy mov overheads. The logic
> is a bit more convoluted, but perhaps we can work on making it simpler.
> Fwiiw, I did not test nor implement anything.
>
> Ah, also, pardon any asm mistakes/obvious details that I may have missed
> :)