[PATCH 12/12] gdb: use two displaced step buffers on amd64/Linux

Wed Nov 25 01:42:20 GMT 2020

On 11/10/20 9:46 PM, Simon Marchi via Gdb-patches wrote:
> As observed on a binary compiled on AMD64 Ubuntu 20.04, against glibc
> 2.31 (I think it's the libc that provides this startup code, right?),

Right.  This is crt0.o.  It's sysdeps/x86_64/start.S in this case:

  https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=7477b632f7ad6ffe6cd1975cdc4ec2317f11a6c3;hb=HEAD

> there are enough bytes at the executable's entry point to hold more than
> one displaced step buffer.  gdbarch_max_insn_length is 16, and the
> code at _start looks like:
> 
> 0000000000001040 <_start>:
>     1040:       f3 0f 1e fa             endbr64
>     1044:       31 ed                   xor    %ebp,%ebp
>     1046:       49 89 d1                mov    %rdx,%r9
>     1049:       5e                      pop    %rsi
>     104a:       48 89 e2                mov    %rsp,%rdx
>     104d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
>     1051:       50                      push   %rax
>     1052:       54                      push   %rsp
>     1053:       4c 8d 05 56 01 00 00    lea    0x156(%rip),%r8        # 11b0 <__libc_csu_fini>
>     105a:       48 8d 0d df 00 00 00    lea    0xdf(%rip),%rcx        # 1140 <__libc_csu_init>
>     1061:       48 8d 3d c1 00 00 00    lea    0xc1(%rip),%rdi        # 1129 <main>
>     1068:       ff 15 72 2f 00 00       callq  *0x2f72(%rip)        # 3fe0 <__libc_start_main@GLIBC_2.2.5>
>     106e:       f4                      hlt
>     106f:       90                      nop
> 
> The two buffers would occupy [0x1040, 0x1060).
> 
> I checked on Alpine, which uses the musl C library, the startup code
> looks like:
> 
> 0000000000001048 <_start>:
>     1048:       48 31 ed                xor    %rbp,%rbp
>     104b:       48 89 e7                mov    %rsp,%rdi
>     104e:       48 8d 35 e3 2d 00 00    lea    0x2de3(%rip),%rsi        # 3e38 <_DYNAMIC>
>     1055:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
>     1059:       e8 00 00 00 00          callq  105e <_start_c>
> 
> 000000000000105e <_start_c>:
>     105e:       48 8b 37                mov    (%rdi),%rsi
>     1061:       48 8d 57 08             lea    0x8(%rdi),%rdx
>     1065:       45 31 c9                xor    %r9d,%r9d
>     1068:       4c 8d 05 47 01 00 00    lea    0x147(%rip),%r8        # 11b6 <_fini>
>     106f:       48 8d 0d 8a ff ff ff    lea    -0x76(%rip),%rcx        # 1000 <_init>
>     1076:       48 8d 3d 0c 01 00 00    lea    0x10c(%rip),%rdi        # 1189 <main>
>     107d:       e9 9e ff ff ff          jmpq   1020 <__libc_start_main@plt>
> 
> Even though there's a _start_c symbol, it all appears to be code that
> runs once at the very beginning of the program, so it looks fine if the
> two buffers occupy [0x1048, 0x1068).
> 
> One important thing I discovered while doing this is that when debugging
> a dynamically-linked executable, breakpoints in the shared library
> loader are hit before executing the _start code, and these breakpoints
> may be displaced-stepped.  So it's very important that the buffer bytes
> are restored properly after doing the displaced steps, otherwise the
> _start code will be corrupted once we try to execute it.
> 
> Another thing that made me think about is that library constructors (as
> in `__attribute__((constructor))`) run before _start.  And they are free
> to spawn threads.  What if one of these threads executes a displaced
> step, therefore changing the bytes at _start, while the main thread
> executes _start?  That doesn't sound good and I don't know how we could
> prevent it.  But this is a problem that predates the current patch.

Maybe we should put the buffer/buffers at the beginning of the
dynamic loader instead.

> 
> Even when stress-testing the implementation, by making many threads do
> displaced steps over and over, I didn't see a significant performance (I
> confirmed that the two buffers were used by checking the "set debug
> displaced" logs though).  However, this patch mostly helps make the
> feature testable by anybody with an AMD64/Linux machine, so I think it's
> useful.

It should speed up the use case of multiple threads hitting a conditional
breakpoint that evals false at the same time.  The more buffers, the less
time threads need to spend waiting for their turn in the displaced stepping
queue.  If that doesn't speed things up, then probably something isn't quite
right.

/me gives it a try.

Here's a testcase that can be used to observe speed and fairness.
It has 10 threads running a single tight loop for 3 seconds.  After
the 3 seconds, the threads exit and the number of iterations each
thread managed to do is printed.  The idea is to set a breakpoint
with a condition that evals false on the loop.  The higher the
number, the better.  

I compiled it with:

 $ gcc disp-step-buffers-test.c -o disp-step-buffers-test -g3 -O2 -pthread

Here's a run outside gdb:

 $ /tmp/disp-step-buffers-test
 thread 0, count 12785417966
 thread 1, count 12784090476
 thread 2, count 12773373753
 thread 3, count 12782542707
 thread 4, count 12767835404
 thread 5, count 12754382637
 thread 6, count 12783454775
 thread 7, count 12605966064
 thread 8, count 12635255271
 thread 9, count 12783436261

Here's the test:

$ cat /tmp/disp-step-buffers-test.c 
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define NUM_THREADS 10

static pthread_t child_thread[NUM_THREADS];
static unsigned long long counters[NUM_THREADS];
static volatile int done;

static void *
child_function (void *arg)
{
  while (!done)
    counters[(long) arg]++;   // set breakpoint here
  return NULL;
}

int
main (void)
{
  long i;

  for (i = 0; i < NUM_THREADS; i++)
    pthread_create (&child_thread[i], NULL, child_function, (void *) i);

  sleep (3);

  done = 1;

  for (i = 0; i < NUM_THREADS; i++)
    pthread_join (child_thread[i], NULL);

  for (i = 0; i < NUM_THREADS; i++)
    printf ("thread %ld, count %llu\n", i, counters[i]);
  return 0;
}

And here are the results under gdb.

Test #1, pristine master (119e99bb7f50):

 (gdb) b 16 if 0
 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16.
 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 thread 0, count 1661
 thread 1, count 1663
 thread 2, count 1646
 thread 3, count 1663
 thread 4, count 1622
 thread 5, count 1661
 thread 6, count 1659
 thread 7, count 1662
 thread 8, count 1660
 thread 9, count 1660
 [Inferior 1 (process 18852) exited normally]
 (gdb) 

Test #2, patches 1 through 11 applied, i.e., 1 buffer:

 (gdb) b 16 if 0
 Breakpoint 1 at 0x1290: file disp-step-buffers-test.c, line 16.
 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 ...
 thread 0, count 966
 thread 1, count 950
 thread 2, count 951
 thread 3, count 950
 thread 4, count 946
 thread 5, count 948
 thread 6, count 946
 thread 7, count 979
 thread 8, count 951
 thread 9, count 966
 [Inferior 1 (process 16099) exited normally]
 (gdb)

Test #3, patches 1 through 12 applied, i.e., 2 buffers:

 (gdb) r
 Starting program: /tmp/disp-step-buffers-test 
 ...
 thread 0, count 1124
 thread 1, count 1128
 thread 2, count 1127
 thread 3, count 1123
 thread 4, count 1121
 thread 5, count 1125
 thread 6, count 1126
 thread 7, count 1126
 thread 8, count 1122
 thread 9, count 1122
 [Inferior 1 (process 14983) exited normally]

Wow, test #2 was surprising, it's twice as slow as current
master.  That was unexpected.

Test #3 with the two buffers improves the speed a bit, but still
quite behind current master.  I think this should be sorted out,
at least understood.

I checked, and it's patch #9 the one that introduces the slowdown:

 gdb: move displaced stepping logic to gdbarch, allow starting concurrent displaced steps

so, I'd suspect some new linear thread walks are causing this?
E.g., displaced_step_in_progress.  Maybe we need to keep counters
of number of threads displaced stepping instead of walking all threads,
for example.  I didn't expect this to be observable with only 10 threads.
I'd expect syscall time / ptrace, etc. to dominate.  Maybe it's
something else?

The above numbers were with an -O0 gdb.  I tried again against a gdb built
at -O2, and the absolute numbers were of course better, but the
slowdown is still observed.  I also tried with different numbers
of threads.  Here's what I see:

|threads | master | patch #11 | patch #12|
|----------------------------------------|
|     10 | 3959   |      2847 |      3055|
|     20 | 1806   |      1055 |      1134|
|     50 |  690   |       258 |       258|