[PATCH] gdbserver/linux: set lwp !stopped when failing to resume

Tue Jan 18 19:33:57 GMT 2022

I see some failures, at least in gdb.multi/multi-re-run.exp and
gdb.threads/interrupted-hand-call.exp.  Running `stress -C $(nproc)` at
the same time as the test makes those tests relatively frequent.

Let's take gdb.multi/multi-re-run.exp as an example.  The failure looks
like this, an unexpected "no resumed":

    continue
    Continuing.
    No unwaited-for children left.
    (gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=2: iter=1: continue until exit

The situation is:

 - Inferior 1 is stopped somewhere, it won't really play a role here.
 - Inferior 2 has 2 threads, both stopped.
 - We resume inferior 2, the leader thread is expected to exit, making
   the process exit.

>From GDB's perspective, a failing run looks like this:

    [infrun] fetch_inferior_event: enter
      [infrun] scoped_disable_commit_resumed: reason=handling event
      [infrun] do_target_wait: Found 2 inferiors, starting at #1
      [infrun] random_pending_event_thread: None found.
      [remote] wait: enter
        [remote] Packet received: T0506:20dcffffff7f0000;07:20dcffffff7f0000;10:9551555555550000;thread:pae4cd.ae4cd;core:e;
      [remote] wait: exit
      [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
      [infrun] print_target_wait_results:   713933.713933.0 [Thread 713933.713933],
      [infrun] print_target_wait_results:   status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
      [infrun] handle_inferior_event: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
      [infrun] clear_step_over_info: clearing step over info
      [infrun] context_switch: Switching context from 0.0.0 to 713933.713933.0
      [infrun] handle_signal_stop: stop_pc=0x555555555195
      [infrun] start_step_over: enter
        [infrun] start_step_over: stealing global queue of threads to step, length = 0
        [infrun] operator(): step-over queue now empty
      [infrun] start_step_over: exit
      [infrun] process_event_stop_test: no stepping, continue
      [remote] Sending packet: $Z0,555555555194,1#8e
      [remote] Packet received: OK
      [infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [713933.713933.0] at 0x555555555195
      [remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;25;2c;4c;97;#0a
      [remote] Packet received: OK
      [remote] Sending packet: $vCont;c:pae4cd.-1#9f
      [infrun] prepare_to_wait: prepare_to_wait
      [infrun] reset: reason=handling event
      [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
      [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
      [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
    [infrun] fetch_inferior_event: exit
    [infrun] fetch_inferior_event: enter
      [infrun] scoped_disable_commit_resumed: reason=handling event
      [infrun] do_target_wait: Found 2 inferiors, starting at #0
      [infrun] random_pending_event_thread: None found.
      [remote] wait: enter
        [remote] Packet received: N
      [remote] wait: exit
      [infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
      [infrun] print_target_wait_results:   -1.0.0 [process -1],
      [infrun] print_target_wait_results:   status->kind = NO_RESUMED
      [infrun] handle_inferior_event: status->kind = NO_RESUMED
      [remote] Sending packet: $Hgp0.0#ad
      [remote] Packet received: OK
      [remote] Sending packet: $qXfer:threads:read::0,1000#92
      [remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
      [infrun] stop_waiting: stop_waiting
      [remote] Sending packet: $qXfer:threads:read::0,1000#92
      [remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
      [infrun] infrun_async: enable=0
      [infrun] reset: reason=handling event
      [infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
      [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
      [infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
    [infrun] fetch_inferior_event: exit

We can see that we resume the inferior with vCont;c, but got NO_RESUMED.
When the test passes, we get an EXITED status to indicate the process
has exited.

>From GDBserver's point of view, it looks like this.  The logs contain
some logging I added and that are part of this patch.

    [remote] getpkt: getpkt ("vCont;c:pae4cf.-1");  [no ack sent]
    [threads] resume: enter
      [threads] thread_needs_step_over: Need step over [LWP 713931]? Ignoring, should remain stopped
      [threads] thread_needs_step_over: Need step over [LWP 713932]? Ignoring, should remain stopped
      [threads] get_pc: pc is 0x555555555195
      [threads] thread_needs_step_over: Need step over [LWP 713935]? No, no breakpoint found at 0x555555555195
      [threads] get_pc: pc is 0x7ffff7d35a95
      [threads] thread_needs_step_over: Need step over [LWP 713936]? No, no breakpoint found at 0x7ffff7d35a95
      [threads] resume: Resuming, no pending status or step over needed
      [threads] resume_one_thread: resuming LWP 713935
      [threads] proceed_one_lwp: lwp 713935
      [threads] resume_one_lwp_throw:   continue from pc 0x555555555195
      [threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, signal 0, stop not expected)
      [threads] resume_one_lwp_throw: NOW ptid=713935.713935.0 stopped=0 resumed=0
      [threads] resume_one_thread: resuming LWP 713936
      [threads] proceed_one_lwp: lwp 713936
      [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
      [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
      [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
    [threads] resume: exit
    [threads] wait_1: enter
      [threads] wait_1: [<all threads>]
      [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
      [threads] resume_stopped_resumed_lwps: resuming stopped-resumed LWP LWP 713935.713936 at 7ffff7d35a95: step=0
      [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
      [threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
      [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
      [threads] operator(): check_zombie_leaders: leader_pid=713931, leader_lp!=NULL=1, num_lwps=2, zombie=0
      [threads] operator(): check_zombie_leaders: leader_pid=713935, leader_lp!=NULL=1, num_lwps=2, zombie=1
      [threads] operator(): Thread group leader 713935 zombie (it exited, or another thread execd).
      [threads] delete_lwp: deleting 713935
      [threads] wait_for_event_filtered: exit (no unwaited-for LWP)
    sigchld_handler
      [threads] wait_1: ret = null_ptid, TARGET_WAITKIND_NO_RESUMED
    [threads] wait_1: exit

What happens is:

 - We resume the leader (713935) successfully.
 - The leader exits.
 - We resume the secondary thread (713936), we get ESRCH.  This is
   expected this the leader has exited.
 - resume_one_lwp_throw throws, it's caught by resume_one_lwp.
 - resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the
   failure can be explained by the LWP becoming zombie, and swallows the
   error.
 - Note that this means that the secondary lwp still has stopped==1.
 - wait_1 is called, probably because linux_process_target::resume marks
   the async pipe at the end.
 - The exit event isn't ready yet, probably because the machine is under
   load, so waitpid returns nothing.
 - check_zombie_leaders detects that the leader is zombie and deletes
 - We try to find a resumed (non-stopped) LWP to get an event from,
   there's none since the leader (that was resumed) is now deleted, and
   the secondary thread is still marked stopped.
   wait_for_event_filtered returns -1, causing wait_1 to return
   NO_RESUMED.

What I notice here is that there is some kind of race between the
availability of the process' exit notification and the call to wait_1
that results from marking the async pipe at the end of resume.

I think what we want from this wait_1 invocation is to keep waiting, as
we will eventually get thread exit notifications for both of our
threads.

The fix I came up with is to mark the secondary thread as !stopped (or
resumed) when we fail to resume it.  This makes wait_1 see that there is
at least one resume lwp, so it won't return NO_RESUMED.  I think this
makes sense to consider it resumed, because we are going to receive an
exit event for it.  Here's the GDBserver logs with the fix applied:

    [threads] resume: enter
      [threads] thread_needs_step_over: Need step over [LWP 724595]? Ignoring, should remain stopped
      [threads] thread_needs_step_over: Need step over [LWP 724596]? Ignoring, should remain stopped
      [threads] get_pc: pc is 0x555555555195
      [threads] thread_needs_step_over: Need step over [LWP 724597]? No, no breakpoint found at 0x555555555195
      [threads] get_pc: pc is 0x7ffff7d35a95
      [threads] thread_needs_step_over: Need step over [LWP 724598]? No, no breakpoint found at 0x7ffff7d35a95
      [threads] resume: Resuming, no pending status or step over needed
      [threads] resume_one_thread: resuming LWP 724597
      [threads] proceed_one_lwp: lwp 724597
      [threads] resume_one_lwp_throw:   continue from pc 0x555555555195
      [threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, signal 0, stop not expected)
      [threads] resume_one_lwp_throw: NOW ptid=724597.724597.0 stopped=0 resumed=0
      [threads] resume_one_thread: resuming LWP 724598
      [threads] proceed_one_lwp: lwp 724598
      [threads] resume_one_lwp_throw:   continue from pc 0x7ffff7d35a95
      [threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, signal 0, stop not expected)
      [threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
    [threads] resume: exit
    [threads] wait_1: enter
      [threads] wait_1: [<all threads>]
    sigchld_handler
      [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
      [threads] operator(): check_zombie_leaders: leader_pid=724595, leader_lp!=NULL=1, num_lwps=2, zombie=0
      [threads] operator(): check_zombie_leaders: leader_pid=724597, leader_lp!=NULL=1, num_lwps=2, zombie=1
      [threads] operator(): Thread group leader 724597 zombie (it exited, or another thread execd).
      [threads] delete_lwp: deleting 724597
      [threads] wait_for_event_filtered: sigsuspend'ing
    sigchld_handler
      [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724598, ERRNO-OK
      [threads] wait_for_event_filtered: waitpid 724598 received 0 (exited)
      [threads] filter_event: 724598 exited
      [threads] wait_for_event_filtered: waitpid(-1, ...) returned 724597, ERRNO-OK
      [threads] wait_for_event_filtered: waitpid 724597 received 0 (exited)
      [threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
    sigchld_handler
      [threads] wait_1: ret = LWP 724597.724598, exited with retcode 0
    [threads] wait_1: exit

Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100
---
 gdbserver/linux-low.cc | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/gdbserver/linux-low.cc b/gdbserver/linux-low.cc
index 9e571a4d771f..9fbdcaa7a96d 100644
--- a/gdbserver/linux-low.cc
+++ b/gdbserver/linux-low.cc
@@ -4057,7 +4057,13 @@ linux_process_target::resume_one_lwp_throw (lwp_info *lwp, int step,
 	  (PTRACE_TYPE_ARG4) (uintptr_t) signal);
 
   if (errno)
-    perror_with_name ("resuming thread");
+    {
+      int saved_errno = errno;
+
+      threads_debug_printf ("ptrace errno = %d (%s)",
+			    saved_errno, strerror (saved_errno));
+      perror_with_name ("resuming thread");
+    }
 
   /* Successfully resumed.  Clear state that no longer makes sense,
      and mark the LWP as running.  Must not do this before resuming
@@ -4118,7 +4124,15 @@ linux_process_target::resume_one_lwp (lwp_info *lwp, int step, int signal,
     }
   catch (const gdb_exception_error &ex)
     {
-      if (!check_ptrace_stopped_lwp_gone (lwp))
+      if (check_ptrace_stopped_lwp_gone (lwp))
+	{
+	  /* This could because we tried to resume an LWP after its leader
+	     exited.  Mark it as resumed, so we can collect an exit event
+	     from it.  */
+	  lwp->stopped = 0;
+	  lwp->stop_reason = TARGET_STOPPED_BY_NO_REASON;
+	}
+      else
 	throw;
     }
 }

base-commit: 91f94053dd72ea88a4d282f94836bc4c0d6c0785
-- 
2.34.1