x86-64: new CET-enabled PLT format proposal

Sun Feb 27 03:18:47 GMT 2022

Hello,

I'd like to propose an alternative instruction sequence for the Intel
CET-enabled PLT section. Compared to the existing one, the new scheme is
simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
require a separate second PLT section (.plt.sec).

Here is the proposed code sequence:

  PLT0:

  f3 0f 1e fa        // endbr64
  41 53              // push %r11
  ff 35 00 00 00 00  // push GOT[1]
  ff 25 00 00 00 00  // jmp *GOT[2]
  0f 1f 40 00        // nop
  0f 1f 40 00        // nop
  0f 1f 40 00        // nop
  66 90              // nop

  PLTn:

  f3 0f 1e fa        // endbr64
  41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
  ff 25 00 00 00 00  // jmp *GOT[namen_index]

GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
PLT entry is called for the first time, the control is passed to PLT0 to call
the resolver function.

It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
already clobbers it.

(*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
preserved, nor is it used to pass arguments. Making this register available as
scratch register means that code in the PLT need not spill any registers when
computing the address to which control needs to be transferred."

FYI, this is the current CET-enabled PLT:

  PLT0:

  ff 35 00 00 00 00    // push GOT[0]
  f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
  0f 1f 00             // nop

  PLTn in .plt:

  f3 0f 1e fa          // endbr64
  68 00 00 00 00       // push $namen_reloc_index
  f2 e9 e1 ff ff ff    // bnd jmpq PLT0
  90                   // nop

  PLTn in .plt.sec:

  f3 0f 1e fa          // endbr64
  f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
  0f 1f 44 00 00       // nop

In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
have many PLT sections while we have only one header, so in practice, the
proposed format is almost 50% smaller than the existing one.

The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
has been deprecated.

I already implemented the proposed scheme to my linker
(https://github.com/rui314/mold) and it looks like it's working fine.

Any thoughts?