[PATCH] Don't stop all threads prematurely after first step of "step N"

Tue Jul 19 19:33:09 GMT 2022

On 2022-07-18 14:54, Pedro Alves wrote:
> In all-stop mode, when the target is itself in non-stop mode (like
> GNU/Linux), if you use the "step N" (or "stepi/next/nexti N") to step
> a thread a number of times:
> 
>  (gdb) help step
>  step, s
>  Step program until it reaches a different source line.
>  Usage: step [N]
>  Argument N means step N times (or till program stops for another reason).
> 
> ... GDB prematurely stops all threads after the first step, and
> doesn't re-resume them for the subsequent N-1 steps.  It's as if for
> the 2nd and subsequent steps, the command was running with
> scheduler-locking enabled.
> 
> This can be observed with the testcase added by this commit, which
> looks like this:
> 
>  static pthread_barrier_t barrier;
> 
>  static void *
>  thread_func (void *arg)
>  {
>    pthread_barrier_wait (&barrier);
>    return NULL;
>  }
> 
>  int
>  main ()
>  {
>    pthread_t thread;
>    int ret;
> 
>    pthread_barrier_init (&barrier, NULL, 2);
> 
>    /* We run to this line below, and then issue "next 3".  That should
>       step over the 3 lines below and land on the return statement.  If
>       GDB prematurely stops the thread_func thread after the first of
>       the 3 nexts (and never resumes it again), then the join won't
>       ever return.  */
>    pthread_create (&thread, NULL, thread_func, NULL); /* set break here */
>    pthread_barrier_wait (&barrier);
>    pthread_join (thread, NULL);
> 
>    return 0;
>  }
> 
> The test hangs and times out without the GDB fix:
> 
>  (gdb) next 3
>  [New Thread 0x7ffff7d89700 (LWP 525772)]
>  FAIL: gdb.threads/step-N-all-progress.exp: non-stop=off: target-non-stop=on: next 3 (timeout)
> 
> The problem is a core gdb bug.
> 
> When you do "step/stepi/next/nexti N", GDB internally creates a
> thread_fsm object and associates it with the stepping thread.  For the
> stepping commands, the FSM's class is step_command_fsm.  That object
> is what keeps track of how many steps are left to make.  When one step
> finishes, handle_inferior_event calls stop_waiting and returns, and
> then fetch_inferior_event calls the "should_stop" method of the event
> thread's FSM.  The implementation of that method decrements the
> steps-left counter.  If the counter is 0, it returns true and we
> proceed to presenting the stop to the user.  If it isn't 0 yet, then
> the method returns false, indicating to fetch_inferior_event to "keep
> going".
> 
> Focusing now on when the first step finishes -- we're in "all-stop"
> mode, with the target in non-stop mode.  When a step finishes,
> handle_inferior_event calls stop_waiting, which itself calls
> stop_all_threads to stop everything.  I.e., after the first step
> completes, all threads are stopped, before handle_inferior_event
> returns.  And after that, now in fetch_inferior_event, we consult the
> thread's thread_fsm::should_stop, which as we've seen, for the first
> step returns false -- i.e., we need to keep_going for another step.
> However, since the target is in non-stop mode, keep_going resumes
> _only_ the current thread.  All the other threads remain stopped,
> inadvertently.
> 
> If the target is in non-stop mode, we don't actually need to stop all
> threads right after each first step finishes, and then re-resume them
> again.  We can instead defer stopping all threads until all the steps
> are completed.
> 
> So fix this by delaying the stopping of all threads until after we
> called the FSM's "should_stop" method.  I.e., move it from
> stop_waiting, to handle_inferior_events's callers,
> fetch_inferior_event and wait_for_inferior.
> 
> New test included.  Tested on x86-64 GNU/Linux native and gdbserver.
> 
> Change-Id: Iaad50dcfea4464c84bdbac853a89df92ade6ae01

LGTM.

Simon