[PATCH] gdb: Don't drop SIGSTOP during stop_all_threads

Andrew Burgess andrew.burgess@embecosm.com
Mon Jun 11 12:55:00 GMT 2018


Ping.

Thanks,
Andrewx

* Andrew Burgess <andrew.burgess@embecosm.com> [2018-05-18 11:56:54 +0100]:

> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process.  This issues was especially likely to
> trigger if the machine (running the inferior) was under load.
> 
> In summary, the problem is an imbalance between two functions in
> linux-nat.c, stop_callback and stop_wait_callback.  In stop_callback we
> send SIGSTOP to a thread, but _only_ if the thread is not already
> stopped, and if it is not signalled, which means it should stop soon.
> In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are
> aware that the thread might have been signalled for some other reason,
> and so if a signal other than SIGSTOP causes the thread to stop then we
> stash that signal away so it can be reported back later.  If we get a
> SIGSTOP then this is discarded, after all, this signal was sent from
> stop_callback.  Except that this might not be the case, it could be that
> SIGSTOP was sent to a thread from elsewhere in GDB, in which case we
> would not have sent another SIGSTOP from stop_callback and the SIGSTOP
> received in stop_wait_callback should not be ignored.
> 
> Below I've laid out the exact sequence of events that I saw that lead me
> to track down the above diagnosis.
> 
> After attaching to the inferior GDB sends a SIGSTOP to all of the
> threads and then returns to the event loop waiting for interesting
> things to happen.
> 
> Eventually the first target event is detected (this will be the first
> SIGSTOP arriving) and GDB calls inferior_event_handler which calls
> fetch_inferior_event.  Inside fetch_inferior_event GDB calls
> do_target_wait which calls target_wait to find a thread with an event.
> 
> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible
> out of the kernel.  This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
> 
> All of the events from the kernel are passed through
> linux_nat_filter_event which stashes the wait status for all of the
> threads that reported a SIGSTOP, these will be returned by future calls
> to linux_nat_wait_1.
> 
> Lets assume for a moment that we've attached to a multi-threaded
> inferior, and that all but one thread has reported its stop during the
> initial wait call in linux_nat_wait_1.  The other thread will be
> reporting a SIGSTOP, but the kernel has not yet managed to deliver that
> signal to GDB before GDB gave up waiting and continued handling the
> events it already had.  GDB selects one of the threads that has reported
> a SIGSTOP and passes this thread ID back to fetch_inferior_event.
> 
> To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
> calls stop_all_threads, this calls wait_one, which in turn calls
> target_wait.
> 
> The first call to target_wait at this point will result in a stashed
> wait status being returned, at which point we call setup_inferior.  The
> call to setup_inferior leads to a call into try_thread_db_load_1 which
> results in a call to linux_stop_and_wait_all_lwps.  This in turn calls
> stop_callback on each thread followed by stop_wait_callback on each
> thread.
> 
> We're now ready to make the mistake.  In stop_callback we see that our
> problem thread is not stopped, but is signalled, so it should stop soon.
> As a result we don't send another SIGSTOP.
> 
> We then enter stop_wait_callback, eventually the problem thread stops
> with SIGSTOP which we _incorrectly_ assume came from stop_callback, and
> we discard.
> 
> Once stop_wait_callback has done its damage we return from
> linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
> eventually unwind back to the call to setup_inferior in
> stop_all_threads.  GDB now loops around, and performs another
> target_wait to get the next event from the inferior.
> 
> The target_wait calls causes us to once again reach linux_nat_wait_1,
> and we pass through some code that calls resume_stopped_resumed_lwps.
> This allows GDB to resume threads that are physically stopped, but which
> GDB doesn't see any good reason for the thread to remain stopped.  In
> our case, the problem thread which had its SIGSTOP discarded is stopped,
> but doesn't have a stashed wait status to report, and so GDB sets the
> thread going again.
> 
> We are now stuck waiting for an event on the problem thread that might
> never arrive.
> 
> When considering how to write a test for this bug I struggled.  The
> issue was only spotted _randomly_ when a machine was heavily loaded with
> many multi-threaded applications, and GDB was being attached (by script)
> to all of these applications in parallel.  In one reproducer I required
> around 5 applications each of 5 threads per machine core in order to
> reproduce the bug 2 out of 3 times.
> 
> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach.  The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second.  Any more than
> that simply return 0 indicating there's no event available.  Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
> 
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.
> 
> gdb/ChangeLog:
> 
> 	* linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
> 	was requested by GDB.
> 
> gdb/testsuite/ChangeLog:
> 
> 	* gdb.threads/attach-slow-waitpid.c: New file.
> 	* gdb.threads/attach-slow-waitpid.exp: New file.
> 	* gdb.threads/slow-waitpid.c: New file.
> ---
>  gdb/ChangeLog                                     |   6 ++
>  gdb/linux-nat.c                                   |  14 ++-
>  gdb/testsuite/ChangeLog                           |   7 ++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.c   |  77 ++++++++++++++
>  gdb/testsuite/gdb.threads/attach-slow-waitpid.exp |  95 ++++++++++++++++++
>  gdb/testsuite/gdb.threads/slow-waitpid.c          | 117 ++++++++++++++++++++++
>  6 files changed, 312 insertions(+), 4 deletions(-)
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
>  create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
>  create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
>  	}
>        else
>  	{
> -	  /* We caught the SIGSTOP that we intended to catch, so
> -	     there's no SIGSTOP pending.  */
> +	  /* We caught the SIGSTOP that we intended to catch.  */
>  
>  	  if (debug_linux_nat)
>  	    fprintf_unfiltered (gdb_stdlog,
>  				"SWC: Expected SIGSTOP caught for %s.\n",
>  				target_pid_to_str (lp->ptid));
>  
> -	  /* Reset SIGNALLED only after the stop_wait_callback call
> -	     above as it does gdb_assert on SIGNALLED.  */
>  	  lp->signalled = 0;
> +
> +	  /* If we are waiting for this stop so we can report the thread
> +	     stopped then we need to record that this status.  Otherwise,
> +	     we can now discard this stop event.  */
> +	  if (lp->last_resume_kind == resume_stop)
> +	    {
> +	      lp->status = status;
> +	      save_stop_reason (lp);
> +	    }
>  	}
>      }
>  
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> new file mode 100644
> index 00000000000..06e99ab22d1
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> @@ -0,0 +1,77 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#define NUM_THREADS 4
> +
> +/* Crude spin lock.  Threads all spin until this is set to 0.  */
> +int go = 1;
> +
> +/* Thread function, just spin until GO is set to 0.  */
> +void *
> +perform_work (void *argument)
> +{
> +  /* Cast to volatile to ensure that ARGUMENT is loaded each time around
> +     the loop.  */
> +  while (*((volatile int*) argument))
> +    {
> +      /* Nothing.  */
> +    }
> +  return NULL;
> +}
> +
> +/* The spin loop for the main thread.  */
> +void
> +function (void)
> +{
> +  (void) perform_work (&go);
> +  printf ("Finished from function\n");
> +}
> +
> +/* Main program, create some threads which all spin waiting for GO to be
> +   set to 0.  */
> +int
> +main (void)
> +{
> +  pthread_t threads[NUM_THREADS];
> +  int result_code;
> +  unsigned index;
> +
> +  /* Create some threads.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      printf ("In main: creating thread %d\n", index);
> +      result_code = pthread_create (&threads[index], NULL, perform_work, &go);
> +      assert (!result_code);
> +    }
> +
> +  function ();
> +
> +  /* Wait for each thread to complete.  */
> +  for (index = 0; index < NUM_THREADS; ++index)
> +    {
> +      /* Block until thread INDEX completes.  */
> +      result_code = pthread_join (threads[index], NULL);
> +      assert (!result_code);
> +      printf ("In main: thread %d has completed\n", index);
> +    }
> +  printf ("In main: All threads completed successfully\n");
> +  return 0;
> +}
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> new file mode 100644
> index 00000000000..d4b12727842
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> @@ -0,0 +1,95 @@
> +# Copyright 2018 Free Software Foundation, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +# This test script tries to expose a bug in some of the use of waitpid
> +# in the linux native support within GDB.  The problem was spotted on
> +# systems which were heavily loaded when attaching to threaded test
> +# programs.  What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls.  And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting.  This is enough to trigger the bug
> +# during the attach phase, however, it breaks the rest of GDB's
> +# behaviour, so we can't do more than just attach with this library in
> +# place.
> +
> +# This test only works on Linux
> +if { ![isnative] || [is_remote host] || [use_gdb_stub]
> +     || ![istarget *-linux*] } {
> +    continue
> +}
> +
> +standard_testfile
> +
> +set libfile slow-waitpid
> +set libsrc "${srcdir}/${subdir}/${libfile}.c"
> +set libobj [standard_output_file ${libfile}.so]
> +
> +# Compile the preload library.  We only get away with this as we limit
> +# this test to running when ISNATIVE is true.
> +if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then {
> +    return -1
> +}
> +
> +# Compile the test program
> +if  { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } {
> +    return -1
> +}
> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD.  CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> +    global env
> +
> +    save_vars { env(LD_PRELOAD) } {
> +	if { ![info exists env(LD_PRELOAD) ]
> +	     || $env(LD_PRELOAD) == "" } {
> +	    set env(LD_PRELOAD) "$lib"
> +	} else {
> +	    append env(LD_PRELOAD) ":$lib"
> +	}
> +
> +	gdb_spawn_with_cmdline_opts $cmdline_opts
> +    }
> +}
> +
> +# Run test program in the background.
> +set test_spawn_id [spawn_wait_for_attach $binfile]
> +set testpid [spawn_id_get_pid $test_spawn_id]
> +
> +# Start GDB with preload library in place.
> +gdb_spawn_with_ld_preload $libobj ""
> +
> +# Load binary, and attach to running program.
> +gdb_load ${binfile}
> +gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
> +
> +gdb_exit
> +
> +# Kill of test program.
> +kill_wait_spawned_process $test_spawn_id
> diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c
> new file mode 100644
> index 00000000000..40b5c958899
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/slow-waitpid.c
> @@ -0,0 +1,117 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> +   Copyright 2018 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#define _GNU_SOURCE
> +
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <sys/time.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <dlfcn.h>
> +#include <string.h>
> +#include <stdarg.h>
> +
> +/* Logging.  */
> +
> +static void
> +log_msg (const char *fmt, ...)
> +{
> +#ifdef LOGGING
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +#endif /* LOGGING */
> +}
> +
> +/* Error handling, message and exit.  */
> +
> +static void
> +error (const char *fmt, ...)
> +{
> +  va_list ap;
> +
> +  va_start (ap, fmt);
> +  vfprintf (stderr, fmt, ap);
> +  va_end (ap);
> +
> +  exit (EXIT_FAILURE);
> +}
> +
> +/* Return true (non-zero) if we should skip this call to waitpid, or false
> +   (zero) if this waitpid call should be handled with a call to the "real"
> +   waitpid library.  Allows 1 waitpid call per second.  */
> +
> +static int
> +should_skip_waitpid (void)
> +{
> +  /* When we last allowed a waitpid to complete.  */
> +  static struct timeval last_waitpid_time = { 0, 0 };
> +
> +  struct timeval *tv = &last_waitpid_time;
> +  if (tv->tv_sec == 0)
> +    {
> +      if (gettimeofday (tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +      return 0; /* Don't skip.  */
> +    }
> +  else
> +    {
> +      struct timeval new_tv;
> +
> +      if (gettimeofday (&new_tv, NULL) < 0)
> +	error ("error: gettimeofday failed\n");
> +
> +      if ((new_tv.tv_sec - tv->tv_sec) < 1)
> +	return 1; /* Skip.  */
> +
> +      memcpy (tv, &new_tv, sizeof (new_tv));
> +    }
> +
> +  /* Don't skip.  */
> +  return 0;
> +}
> +
> +/* The waitpid entry point function.  */
> +
> +pid_t
> +waitpid (pid_t pid, int *wstatus, int options)
> +{
> +  typedef pid_t (*fptr_t) (pid_t, int *, int);
> +  static fptr_t real_func = NULL;
> +
> +  if (real_func == NULL)
> +    {
> +      real_func = dlsym (RTLD_NEXT, "waitpid");
> +      if (real_func == NULL)
> +	error ("error: failed to find real waitpid\n");
> +    }
> +
> +  log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options);
> +
> +  if ((options & WNOHANG) != 0
> +      && pid == -1
> +      && should_skip_waitpid ())
> +    {
> +      log_msg ("waitpid: skipping\n");
> +      return 0;
> +    }
> +
> +  return (*real_func) (pid, wstatus, options);
> +}
> -- 
> 2.14.3
> 



More information about the Gdb-patches mailing list