This is the mail archive of the
mailing list for the binutils project.
Re: On the implementation of IBT-enabled PLT with lazy binding
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Fāng-ruì Sòng <maskray at google dot com>
- Cc: x86-64-abi <x86-64-abi at googlegroups dot com>, Binutils <binutils at sourceware dot org>
- Date: Wed, 3 Apr 2019 06:12:17 -0700
- Subject: Re: On the implementation of IBT-enabled PLT with lazy binding
- References: <CAFP8O3KKuz6aN5G5f+W4k5-BjyeDvxOFhKaMXcovPpdXMDZ4xw@mail.gmail.com>
On Tue, Apr 2, 2019 at 11:11 PM 'Fāng-ruì Sòng' via X86-64 System V
Application Binary Interface <email@example.com> wrote:
> Chapter 13 "Intel CET Extension" of x86-64 psABI describes an
> alternative PLT scheme for IBT (Indirect Branch Tracking). With GCC>=8
> and latest ld.bfd (in binutils-gdb), we can see the synthesized PLT
> gcc -g -fuse-ld=bfd -fcf-protection=branch a.c -Wl,-z,ibtplt,-z,now -o
> a # -mibt for some older GCC 8 releases
> objdump -d a
> A PLT function, say `putchar`, has instruction sequences in both .plt
> and .plt.sec:
> .plt (16 bytes)
> 1030: f3 0f 1e fa endbr64
> 1034: 68 00 00 00 00 pushq $0x0
> 1039: f2 e9 e1 ff ff ff bnd jmpq 1020 <.plt>
> 103f: 90 nop
> .plt.sec (16 bytes)
> 0000000000001060 <putchar@plt>:
> 1060: f3 0f 1e fa endbr64
> 1064: f2 ff 25 5d 2f 00 00 bnd jmpq *0x2f5d(%rip)
> # 3fc8 <putchar@GLIBC_2.2.5>
> 106b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> .text uses `callq 1060` to call putchar@plt. 0x1064 jumps to 0x1030
> for the initial call (lazy binding). After the stub at 0x1030 resolves
> the GOT slot to the real entry, future 0x1060 calls will jump directly
> to the real entry.
> I have several questions regarding the second PLT scheme.
> 1. Should psABI change .splt to .plt.sec?
> The implementation uses .plt.sec for this feature.
> PLT sections do not have a dedicated section type and in practice they
> are usually recognized by the name .plt . The tools include but not
> limited to disassemblers (objdump, llvm-objdump), assemblers
> (assemblers (e.g. llvm-mc) emit warnings for unusual flags), binary
> instrumentation tools, profilers, debuggers.
> If the implementations pick names different from the ABI, tools have
> to understand both .plt.sec and .splt to be ABI conforming. The
> complexity could have been avoided if implementations and the ABI
> agreed on the same name: .plt.sec
> I prefer .plt.sec to .splt because the convention is already used in
> several other places to assign fine-grained semantics to sections,
> e.g. .text.hot .text.startup .text.unlikely
> 2. Merge .plt and .plt.sec
> As I proposed at https://reviews.llvm.org/D59780#1451608 , since we
> don't emit the bnd prefix (0xf2) for MPX
> (dropped by GCC 9), we can merge .plt and .plt.sec entries as follows:
> 4 endbr64
> 5 jmpq *xxx(%rip) ; jump to the next endbr64 for lazy binding
> 4 endbr64
> 5 pushq ; relocaton index
> 5 jmpq *xxx(%rip) ; jump to .plt
> This PLT entry takes 4+5+4+5+5=23 bytes, and fits in a 24-byte entry
> size if we aim for 8-byte alignment.
> Not having to deal with .plt.sec simplifies implementation of PLT-aware tools.
> (If MPX resurrects (I am not sure about the likelyhood), the bnd
> prefixes before jmpq will take another 2 bytes and the PLT entry will
> no longer fit in a 24-byte entry. We can expand it to 32-byte then)
> 3. Necessity of the second PLT
> It was raised in https://reviews.llvm.org/D58102 that having
> instruction sequences split into .plt and .plt.sec, it may improve
> code cache locality. According to my understanding, in theory .plt.sec
> is hot while .plt is cold (only used for the first time). That being
> said, we see no evidence or benchmark results supporting the claim.
It may not show up on your benchmarks. But improve cache locality
is a good thing for overall system performance. We need every bit
of performance for CET.
> The other argument is that it provides compatibility with other tools
> that have an hardcoded limit of 16.
> I found top-of-tree gdb/objdump cannot symbolize 16-byte .plt and
> .plt.sec entries without the bnd prefixes
> (https://reviews.llvm.org/D59780#1451608) => even if the entry size
> sticks with 16, existing tools have to adapt new rules to symbolize
> PLT entries.
> Thus, we are causing trouble to existing tools, no matter we introduce
> the second PLT or not.
> Given the complexity of the second PLT, not having the second PLT
> might be better.
A single PLT is simpler to implement. We designed 2 PLTs with performance
in mind. We have implemented it many years ago starting from MPX. It shouldn't
be changed just because it is "hard" to implement.
> BTW, I want to remind readers the subject of this email contains "lazy
> binding". 32-byte imposes more overhead to libc's without the lazy
> binding functionality. For musl, a 5-byte `jmpq` instruction suffices,
> but of course `-fno-plt` may be a better solution to not deal with PLT
> stuff at all.
> P.S. Don't get me wrong. The new security enhancement technology
> attracts me. I've done a few ROP-style CTF pwn challenges in the past
> and can imagine how useful IBT is, but I hope it introduces less
> complexity to toolchains :)
2 PLTs is a small piece for CET run-time, comparing with kernel, GCC
and glibc, partially because ld has a very flexible PLT framework to
accommodate different PLT schemes with lazy PLT (the first PLT)
and non-lazy PLT (the second PLT).