Bug 19675

Summary: GDB doesn't set PC correctly with displaced stepping over clone syscall
Product: gdb Reporter: Yao Qi <qiyao>
Component: gdbAssignee: Pedro Alves <pedro>
Status: RESOLVED FIXED    
Severity: normal CC: pedro, vries
Priority: P2    
Version: HEAD   
Target Milestone: 15.1   
Host: Target:
Build: Last reconfirmed:

Description Yao Qi 2016-02-19 12:06:16 UTC
When GDB displaced step a syscall instruction which creates new thread, such as fork, vfork and clone, GDB needs to adjust the the PC of child.  GDB has already done that for fork and vfork, see this snippet in infrun.c:

	    /* GDB has got TARGET_WAITKIND_FORKED or TARGET_WAITKIND_VFORKED,
	       indicating that the displaced stepping of syscall instruction
	       has been done.  Perform cleanup for parent process here.  Note
	       that this operation also cleans up the child process for vfork,
	       because their pages are shared.  */

but it doesn't for clone, so the PC of child is set in scratch pad, which is wrong.  See the reproducer below,

(gdb) b clone
Breakpoint 1 at 0x400490
(gdb) run
Starting program: /tmp/2.exe 

Breakpoint 1, 0x00007ffff7b0f410 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) disassemble 
Dump of assembler code for function clone:
=> 0x00007ffff7b0f410 <+0>:	mov    $0xffffffffffffffea,%rax
   0x00007ffff7b0f417 <+7>:	test   %rdi,%rdi
...
   0x00007ffff7b0f43a <+42>:	mov    $0x38,%eax
   0x00007ffff7b0f43f <+47>:	syscall      // the syscall call insn doing clone
   0x00007ffff7b0f441 <+49>:	test   %rax,%rax
   0x00007ffff7b0f444 <+52>:	jl     0x7ffff7b0f485 <clone+117>
...
   0x00007ffff7b0f495 <+133>:	retq   
End of assembler dump.
(gdb) b *0x00007ffff7b0f43f
Breakpoint 2 at 0x7ffff7b0f43f
(gdb) c
Continuing.

Breakpoint 2, 0x00007ffff7b0f43f in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) set displaced-stepping on
(gdb) si
[New LWP 13375]

Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to LWP 13375]
0x00000000004004e5 in _start ()
(gdb) info threads 
  Id   Target Id         Frame 
  1    LWP 13370 "2.exe" 0x00007ffff7b0f441 in clone () from /lib/x86_64-linux-gnu/libc.so.6
* 2    LWP 13375         0x00000000004004e5 in _start ()

$ cat /tmp/2.c 
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

static void
marker () {}

#define STACK_SIZE 0x1000

static int
clone_fn (void *unused)
{
  return 0;
}

int
main (void)
{
  int i, pid;
  unsigned char *stack;

  stack = malloc (STACK_SIZE);

  pid = clone (clone_fn, stack + STACK_SIZE, CLONE_FILES | CLONE_VM,
	       NULL);

  free (stack);
  marker ();
}
-------------------------------
Comment 1 Sourceware Commits 2021-08-05 09:54:20 UTC
The master branch has been updated by Andrew Burgess <aburgess@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=99ba4b64d3636533f3659a9c7d3e8504ca93c770

commit 99ba4b64d3636533f3659a9c7d3e8504ca93c770
Author: Andrew Burgess <andrew.burgess@embecosm.com>
Date:   Tue Jun 8 12:49:04 2021 +0100

    gdb/testsuite: update test gdb.base/step-over-syscall.exp
    
    I was looking at PR gdb/19675 and the related test
    gdb.base/step-over-syscall.exp.  This test includes a call to kfail
    when we are testing a displaced step over a clone syscall.
    
    While looking at the test I removed the call to kfail and ran the
    test, and was surprised that the test passed.
    
    I ran the test a few times and it does sometimes fail, but mostly it
    passed fine.
    
    PR gdb/19675 describes how, when we displaced step over a clone, the
    new thread is created with a $pc in the displaced step buffer.  GDB
    then fails to "fix" this $pc (for the new thread), and the thread will
    be set running with its current $pc value.  This means that the new
    thread will just start executing from whatever happens to be after the
    displaced stepping buffer.
    
    In the original PR gdb/19675 bug report Yao Qi was seeing the new
    thread cause a segfault, the problem is, what actually happens is
    totally undefined.
    
    On my machine, I'm seeing the new thread reenter main, it then starts
    trying to run the test again (in the new thread).  This just happens
    to be safe enough (in this simple test) that most of the time the
    inferior doesn't crash.
    
    In this commit I try to make the test slightly more likely to fail by
    doing a couple of things.
    
    First, I added a static variable to main, this is set true when the
    first thread enters main, if a second thread ever enters main then I
    force an abort.
    
    Second, when the test is finishing I want to ensure that the new
    threads have had a chance to do "something bad" if they are going to.
    So I added a global counter, as each thread starts successfully it
    decrements the counter.  The main thread does not proceed to the final
    marker function (where GDB has placed a breakpoint) until all threads
    have started successfully.  This means that if the newly created
    thread doesn't successfully enter clone_fn then the counter will never
    reach zero and the test will timeout.
    
    With these two changes my hope is that the test should fail more
    reliably, and so, I have also changed the test to call setup_kfail
    before the specific steps that we expect to misbehave instead of just
    calling kfail and skipping parts of the test completely.  The benefit
    of this is that if/when we fix GDB this test will start to KPASS and
    we'll know to update this test to remove the setup_kfail call.
Comment 2 Tom de Vries 2021-10-20 15:32:21 UTC
FTR, I'm also seeing:
...
(gdb) PASS: gdb.base/step-over-syscall.exp: clone: displaced=on: break marker
continue^M
Continuing.^M
[New Thread 0x7ffff7fe7700 (LWP 9850)]^M
../../gdb/linux-nat.c:1919: internal-error: wait returned unexpected status 0xb^M
A problem internal to GDB has been detected,^M
further debugging may prove unreliable.^M
Quit this debugging session? (y or n) KFAIL: gdb.base/step-over-syscall.exp: clone: displaced=on: continue to marker (clone) (GDB internal error) (PRMS: gdb/19675)
...
Comment 3 Pedro Alves 2022-06-21 11:25:18 UTC
I've posted a series that fixes this, here:
 [PATCH 00/25] Step over thread clone and thread exit
 https://sourceware.org/pipermail/gdb-patches/2022-June/190181.html
Comment 4 Sourceware Commits 2023-11-13 14:25:28 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=0d36baa9af0d9929c96b89a184a469c432c68b0d

commit 0d36baa9af0d9929c96b89a184a469c432c68b0d
Author: Pedro Alves <pedro@palves.net>
Date:   Fri Nov 12 20:50:29 2021 +0000

    Step over clone syscall w/ breakpoint, TARGET_WAITKIND_THREAD_CLONED
    
    (A good chunk of the problem statement in the commit log below is
    Andrew's, adjusted for a different solution, and for covering
    displaced stepping too.  The testcase is mostly Andrew's too.)
    
    This commit addresses bugs gdb/19675 and gdb/27830, which are about
    stepping over a breakpoint set at a clone syscall instruction, one is
    about displaced stepping, and the other about in-line stepping.
    
    Currently, when a new thread is created through a clone syscall, GDB
    sets the new thread running.  With 'continue' this makes sense
    (assuming no schedlock):
    
     - all-stop mode, user issues 'continue', all threads are set running,
       a newly created thread should also be set running.
    
     - non-stop mode, user issues 'continue', other pre-existing threads
       are not affected, but as the new thread is (sort-of) a child of the
       thread the user asked to run, it makes sense that the new threads
       should be created in the running state.
    
    Similarly, if we are stopped at the clone syscall, and there's no
    software breakpoint at this address, then the current behaviour is
    fine:
    
     - all-stop mode, user issues 'stepi', stepping will be done in place
       (as there's no breakpoint to step over).  While stepping the thread
       of interest all the other threads will be allowed to continue.  A
       newly created thread will be set running, and then stopped once the
       thread of interest has completed its step.
    
     - non-stop mode, user issues 'stepi', stepping will be done in place
       (as there's no breakpoint to step over).  Other threads might be
       running or stopped, but as with the continue case above, the new
       thread will be created running.  The only possible issue here is
       that the new thread will be left running after the initial thread
       has completed its stepi.  The user would need to manually select
       the thread and interrupt it, this might not be what the user
       expects.  However, this is not something this commit tries to
       change.
    
    The problem then is what happens when we try to step over a clone
    syscall if there is a breakpoint at the syscall address.
    
    - For both all-stop and non-stop modes, with in-line stepping:
    
       + user issues 'stepi',
       + [non-stop mode only] GDB stops all threads.  In all-stop mode all
         threads are already stopped.
       + GDB removes s/w breakpoint at syscall address,
       + GDB single steps just the thread of interest, all other threads
         are left stopped,
       + New thread is created running,
       + Initial thread completes its step,
       + [non-stop mode only] GDB resumes all threads that it previously
         stopped.
    
    There are two problems in the in-line stepping scenario above:
    
      1. The new thread might pass through the same code that the initial
         thread is in (i.e. the clone syscall code), in which case it will
         fail to hit the breakpoint in clone as this was removed so the
         first thread can single step,
    
      2. The new thread might trigger some other stop event before the
         initial thread reports its step completion.  If this happens we
         end up triggering an assertion as GDB assumes that only the
         thread being stepped should stop.  The assert looks like this:
    
         infrun.c:5899: internal-error: int finish_step_over(execution_control_state*): Assertion `ecs->event_thread->control.trap_expected' failed.
    
    - For both all-stop and non-stop modes, with displaced stepping:
    
       + user issues 'stepi',
       + GDB starts the displaced step, moves thread's PC to the
         out-of-line scratch pad, maybe adjusts registers,
       + GDB single steps the thread of interest, [non-stop mode only] all
         other threads are left as they were, either running or stopped.
         In all-stop, all other threads are left stopped.
       + New thread is created running,
       + Initial thread completes its step, GDB re-adjusts its PC,
         restores/releases scratchpad,
       + [non-stop mode only] GDB resumes the thread, now past its
         breakpoint.
       + [all-stop mode only] GDB resumes all threads.
    
    There is one problem with the displaced stepping scenario above:
    
      3. When the parent thread completed its step, GDB adjusted its PC,
         but did not adjust the child's PC, thus that new child thread
         will continue execution in the scratch pad, invoking undefined
         behavior.  If you're lucky, you see a crash.  If unlucky, the
         inferior gets silently corrupted.
    
    What is needed is for GDB to have more control over whether the new
    thread is created running or not.  Issue #1 above requires that the
    new thread not be allowed to run until the breakpoint has been
    reinserted.  The only way to guarantee this is if the new thread is
    held in a stopped state until the single step has completed.  Issue #3
    above requires that GDB is informed of when a thread clones itself,
    and of what is the child's ptid, so that GDB can fixup both the parent
    and the child.
    
    When looking for solutions to this problem I considered how GDB
    handles fork/vfork as these have some of the same issues.  The main
    difference between fork/vfork and clone is that the clone events are
    not reported back to core GDB.  Instead, the clone event is handled
    automatically in the target code and the child thread is immediately
    set running.
    
    Note we have support for requesting thread creation events out of the
    target (TARGET_WAITKIND_THREAD_CREATED).  However, those are reported
    for the new/child thread.  That would be sufficient to address in-line
    stepping (issue #1), but not for displaced-stepping (issue #3).  To
    handle displaced-stepping, we need an event that is reported to the
    _parent_ of the clone, as the information about the displaced step is
    associated with the clone parent.  TARGET_WAITKIND_THREAD_CREATED
    includes no indication of which thread is the parent that spawned the
    new child.  In fact, for some targets, like e.g., Windows, it would be
    impossible to know which thread that was, as thread creation there
    doesn't work by "cloning".
    
    The solution implemented here is to model clone on fork/vfork, and
    introduce a new TARGET_WAITKIND_THREAD_CLONED event.  This event is
    similar to TARGET_WAITKIND_FORKED and TARGET_WAITKIND_VFORKED, except
    that we end up with a new thread in the same process, instead of a new
    thread of a new process.  Like FORKED and VFORKED, THREAD_CLONED
    waitstatuses have a child_ptid property, and the child is held stopped
    until GDB explicitly resumes it.  This addresses the in-line stepping
    case (issues #1 and #2).
    
    The infrun code that handles displaced stepping fixup for the child
    after a fork/vfork event is thus reused for THREAD_CLONE, with some
    minimal conditions added, addressing the displaced stepping case
    (issue #3).
    
    The native Linux backend is adjusted to unconditionally report
    TARGET_WAITKIND_THREAD_CLONED events to the core.
    
    Following the follow_fork model in core GDB, we introduce a
    target_follow_clone target method, which is responsible for making the
    new clone child visible to the rest of GDB.
    
    Subsequent patches will add clone events support to the remote
    protocol and gdbserver.
    
    displaced_step_in_progress_thread becomes unused with this patch, but
    a new use will reappear later in the series.  To avoid deleting it and
    readding it back, this patch marks it with attribute unused, and the
    latter patch removes the attribute again.  We need to do this because
    the function is static, and with no callers, the compiler would warn,
    (error with -Werror), breaking the build.
    
    This adds a new gdb.threads/stepi-over-clone.exp testcase, which
    exercises stepping over a clone syscall, with displaced stepping vs
    inline stepping, and all-stop vs non-stop.  We already test stepping
    over clone syscalls with gdb.base/step-over-syscall.exp, but this test
    uses pthreads, while the other test uses raw clone, and this one is
    more thorough.  The testcase passes on native GNU/Linux, but fails
    against GDBserver.  GDBserver will be fixed by a later patch in the
    series.
    
    Co-authored-by: Andrew Burgess <aburgess@redhat.com>
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
    Change-Id: I95c06024736384ae8542a67ed9fdf6534c325c8e
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Comment 5 Sourceware Commits 2023-11-13 14:25:33 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=53de5394f7bf11995b1d9cb6885a8490b2ebc9da

commit 53de5394f7bf11995b1d9cb6885a8490b2ebc9da
Author: Pedro Alves <pedro@palves.net>
Date:   Tue Nov 23 20:35:12 2021 +0000

    Support clone events in the remote protocol
    
    The previous patch taught GDB about a new
    TARGET_WAITKIND_THREAD_CLONED event kind, and made the Linux target
    report clone events.
    
    A following patch will teach Linux GDBserver to do the same thing.
    
    But before we get there, we need to teach the remote protocol about
    TARGET_WAITKIND_THREAD_CLONED.  That's what this patch does.  Clone is
    very similar to vfork and fork, and the new stop reply is likewise
    handled similarly.  The stub reports "T05clone:...".
    
    GDBserver core is taught to handle TARGET_WAITKIND_THREAD_CLONED and
    forward it to GDB in this patch, but no backend actually emits it yet.
    That will be done in a following patch.
    
    Documentation for this new remote protocol feature is included in a
    documentation patch later in the series.
    
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
    Change-Id: If271f20320d864f074d8ac0d531cc1a323da847f
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
Comment 6 Sourceware Commits 2023-11-13 14:25:44 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=65c459abebf70bd5a64dcee11d4d7d4a8498465f

commit 65c459abebf70bd5a64dcee11d4d7d4a8498465f
Author: Pedro Alves <pedro@palves.net>
Date:   Tue Nov 23 20:35:12 2021 +0000

    Thread options & clone events (core + remote)
    
    A previous patch taught GDB about a new TARGET_WAITKIND_THREAD_CLONED
    event kind, and made the Linux target report clone events.
    
    A following patch will teach Linux GDBserver to do the same thing.
    
    However, for remote debugging, it wouldn't be ideal for GDBserver to
    report every clone event to GDB, when GDB only cares about such events
    in some specific situations.  Reporting clone events all the time
    would be potentially chatty.  We don't enable thread create/exit
    events all the time for the same reason.  Instead we have the
    QThreadEvents packet.  QThreadEvents is target-wide, though.
    
    This patch makes GDB instead explicitly request that the target
    reports clone events or not, on a per-thread basis.
    
    In order to be able to do that with GDBserver, we need a new remote
    protocol feature.  Since a following patch will want to enable thread
    exit events on per-thread basis too, the packet introduced here is
    more generic than just for clone events.  It lets you enable/disable a
    set of options at once, modelled on Linux ptrace's PTRACE_SETOPTIONS.
    
    IOW, this commit introduces a new QThreadOptions packet, that lets you
    specify a set of per-thread event options you want to enable.  The
    packet accepts a list of options/thread-id pairs, similarly to vCont,
    processed left to right, with the options field being a number
    interpreted as a bit mask of options.  The only option defined in this
    commit is GDB_THREAD_OPTION_CLONE (0x1), which ask the remote target
    to report clone events.  Another patch later in the series will
    introduce another option.
    
    For example, this packet sets option "1" (clone events) on thread
    p1000.2345:
    
      QThreadOptions;1:p1000.2345
    
    and this clears options for all threads of process 1000, and then sets
    option "1" (clone events) on thread p1000.2345:
    
      QThreadOptions;0:p1000.-1;1:p1000.2345
    
    This clears options of all threads of all processes:
    
      QThreadOptions;0
    
    The target reports the set of supported options by including
    "QThreadOptions=<supported options>" in its qSupported response.
    
    infrun is then tweaked to enable GDB_THREAD_OPTION_CLONE when stepping
    over a breakpoint.
    
    Unlike PTRACE_SETOPTIONS, fork/vfork/clone children do NOT inherit
    their parent's thread options.  This is so that GDB can send e.g.,
    "QThreadOptions;0;1:TID" without worrying about threads it doesn't
    know about yet.
    
    Documentation for this new remote protocol feature is included in a
    documentation patch later in the series.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
    Change-Id: Ie41e5093b2573f14cf6ac41b0b5804eba75be37e
Comment 7 Sourceware Commits 2023-11-13 14:25:49 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=25b16bc9e791d53028c3c180125a80f345b97d94

commit 25b16bc9e791d53028c3c180125a80f345b97d94
Author: Pedro Alves <pedro@palves.net>
Date:   Tue Nov 23 20:35:12 2021 +0000

    Thread options & clone events (native Linux)
    
    This commit teaches the native Linux target about the
    GDB_THREAD_OPTION_CLONE thread option.  It's actually simpler to just
    continue reporting all clone events unconditionally to the core.
    There's never any harm in reporting a clone event when the option is
    disabled.  All we need to do is to report support for the option,
    otherwise GDB falls back to use target_thread_events().
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
    Change-Id: If90316e2dcd0c61d0fefa0d463c046011698acf9
Comment 8 Sourceware Commits 2023-11-13 14:25:54 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=393a6b5947d037a55ce1b57474e1ffb3074f544e

commit 393a6b5947d037a55ce1b57474e1ffb3074f544e
Author: Pedro Alves <pedro@palves.net>
Date:   Tue Nov 23 20:35:12 2021 +0000

    Thread options & clone events (Linux GDBserver)
    
    This patch teaches the Linux GDBserver backend to report clone events
    to GDB, when GDB has requested them with the GDB_THREAD_OPTION_CLONE
    thread option, via the new QThreadOptions packet.
    
    This shuffles code in linux_process_target::handle_extended_wait
    around to a more logical order when we now have to handle and
    potentially report all of fork/vfork/clone.
    
    Raname lwp_info::fork_relative -> lwp_info::relative as the field is
    no longer only about (v)fork.
    
    With this, gdb.threads/stepi-over-clone.exp now cleanly passes against
    GDBserver, so remove the native-target-only requirement from that
    testcase.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=27830
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
    Change-Id: I3a19bc98801ec31e5c6fdbe1ebe17df855142bb2
Comment 9 Sourceware Commits 2023-11-13 14:26:04 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=6bd50ebd29883ac003fc936a5730ca55364f34e7

commit 6bd50ebd29883ac003fc936a5730ca55364f34e7
Author: Pedro Alves <pedro@palves.net>
Date:   Mon Jun 13 17:51:00 2022 +0100

    Remove gdb/19675 kfails (displaced stepping + clone)
    
    Now that gdb/19675 is fixed for both native and gdbserver GNU/Linux,
    remove the gdb/19675 kfails.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19675
    Reviewed-By: Andrew Burgess <aburgess@redhat.com>
    Change-Id: I95c1c38ca370100675d303cd3c8995860bef465d
Comment 10 Pedro Alves 2023-11-13 15:01:33 UTC
Fixed in master.