[PATCH] gdb: Don't drop SIGSTOP during stop_all_threads

Mon Jun 11 16:41:00 GMT 2018

On 05/18/2018 11:56 AM, Andrew Burgess wrote:
> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process.  This issues was especially likely to
> trigger if the machine (running the inferior) was under load.

"issues" -> "issue" two times.

> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible

"non" -> "none"

> out of the kernel.  This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
> 
> All of the events from the kernel are passed through

> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach.  The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second.  Any more than
> that simply return 0 indicating there's no event available.  Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
> 
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.

The fix sounds good to me.

Thanks so much for adding a testcase.  That adds a lot of value.

Some more comments below.

> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
>  	}
>        else
>  	{
> -	  /* We caught the SIGSTOP that we intended to catch, so
> -	     there's no SIGSTOP pending.  */
> +	  /* We caught the SIGSTOP that we intended to catch.  */
>  
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog,
>  				"SWC: Expected SIGSTOP caught for %s.\n",
>  				target_pid_to_str (lp->ptid));
>  
> -	  /* Reset SIGNALLED only after the stop_wait_callback call
> -	     above as it does gdb_assert on SIGNALLED.  */
>  	  lp->signalled = 0;
> +
> +	  /* If we are waiting for this stop so we can report the thread
> +	     stopped then we need to record that this status.  Otherwise,

Looks like on "that" too many in "that this"?

> +	     we can now discard this stop event.  */
> +	  if (lp->last_resume_kind == resume_stop)
> +	    {
> +	      lp->status = status;
> +	      save_stop_reason (lp);
> +	    }
>  	}
>      }

> +# This test script tries to expose a bug in some of the use of waitpid

"some of the uses" ?

> +# in the linux native support within GDB.  The problem was spotted on

"linux" -> "Linux".

> +# systems which were heavily loaded when attaching to threaded test
> +# programs.  What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting.  

I find this last sentence starting with "And" hard to parse.  Maybe
you meant "All" instead of "And".  I'd also suggest splitting it in
two and dropping the comma after "-1".  Like:

All waitpid calls targeting pid -1 with the WNOHANG flag are rate
limited so that only 1 per second can complete.  Additional calls are
forced to return 0 indicating no event waiting.

> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> +    global env
> +
> +    save_vars { env(LD_PRELOAD) } {
> +	if { ![info exists env(LD_PRELOAD) ]
> +	     || $env(LD_PRELOAD) == "" } {
> +	    set env(LD_PRELOAD) "$lib"
> +	} else {
> +	    append env(LD_PRELOAD) ":$lib"
> +	}
> +
> +	gdb_spawn_with_cmdline_opts $cmdline_opts
> +    }
> +}

The testcase currently fails with --target_board=native-extended-gdbserver:

 (gdb) attach 487
 Don't know how to attach.  Try "help target".
 (gdb) FAIL: gdb.threads/attach-slow-waitpid.exp: attach to target

It'd be really nice to make it work there too, considering that at
some point we may switch to using gdbserver's linux backend even
for native debugging.

The test fails there because that board overrides gdb_start to
automatically start gdbserver and have gdb connect to it, but not
gdb_spawn_with_cmdline_opts.  Can we simply use gdb_start instead?
It seems like no cmdline opts are passed down anyway.

In principle, the LD_PRELOAD will end up injecting the slow-waitpid
lib in gdbserver too.

> +static int
> +should_skip_waitpid (void)
> +{
> +  /* When we last allowed a waitpid to complete.  */
> +  static struct timeval last_waitpid_time = { 0, 0 };
> +
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < 1)
> +	return 1; /* Skip.  */
> +
> +      memcpy (tv, &new_tv, sizeof (new_tv));

Write:

      *tv = new_tv;

?

Thanks,
Pedro Alves