[PATCH] gdb: Don't drop SIGSTOP during stop_all_threads
Andrew Burgess
andrew.burgess@embecosm.com
Mon Jun 11 12:55:00 GMT 2018
Ping.
Thanks,
Andrewx
* Andrew Burgess <andrew.burgess@embecosm.com> [2018-05-18 11:56:54 +0100]:
> This patch fixes an issues where GDB would sometimes hang when attaching
> to a multi-threaded process. This issues was especially likely to
> trigger if the machine (running the inferior) was under load.
>
> In summary, the problem is an imbalance between two functions in
> linux-nat.c, stop_callback and stop_wait_callback. In stop_callback we
> send SIGSTOP to a thread, but _only_ if the thread is not already
> stopped, and if it is not signalled, which means it should stop soon.
> In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are
> aware that the thread might have been signalled for some other reason,
> and so if a signal other than SIGSTOP causes the thread to stop then we
> stash that signal away so it can be reported back later. If we get a
> SIGSTOP then this is discarded, after all, this signal was sent from
> stop_callback. Except that this might not be the case, it could be that
> SIGSTOP was sent to a thread from elsewhere in GDB, in which case we
> would not have sent another SIGSTOP from stop_callback and the SIGSTOP
> received in stop_wait_callback should not be ignored.
>
> Below I've laid out the exact sequence of events that I saw that lead me
> to track down the above diagnosis.
>
> After attaching to the inferior GDB sends a SIGSTOP to all of the
> threads and then returns to the event loop waiting for interesting
> things to happen.
>
> Eventually the first target event is detected (this will be the first
> SIGSTOP arriving) and GDB calls inferior_event_handler which calls
> fetch_inferior_event. Inside fetch_inferior_event GDB calls
> do_target_wait which calls target_wait to find a thread with an event.
>
> The target_wait call ends up in linux_nat_wait_1, which first checks to
> see if any threads already have stashed stop events to report, and if
> there are non then we enter a loop fetching as many events as possible
> out of the kernel. This event fetching is non-blocking, and we give up
> once the kernel has no more events ready to give us.
>
> All of the events from the kernel are passed through
> linux_nat_filter_event which stashes the wait status for all of the
> threads that reported a SIGSTOP, these will be returned by future calls
> to linux_nat_wait_1.
>
> Lets assume for a moment that we've attached to a multi-threaded
> inferior, and that all but one thread has reported its stop during the
> initial wait call in linux_nat_wait_1. The other thread will be
> reporting a SIGSTOP, but the kernel has not yet managed to deliver that
> signal to GDB before GDB gave up waiting and continued handling the
> events it already had. GDB selects one of the threads that has reported
> a SIGSTOP and passes this thread ID back to fetch_inferior_event.
>
> To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
> calls stop_all_threads, this calls wait_one, which in turn calls
> target_wait.
>
> The first call to target_wait at this point will result in a stashed
> wait status being returned, at which point we call setup_inferior. The
> call to setup_inferior leads to a call into try_thread_db_load_1 which
> results in a call to linux_stop_and_wait_all_lwps. This in turn calls
> stop_callback on each thread followed by stop_wait_callback on each
> thread.
>
> We're now ready to make the mistake. In stop_callback we see that our
> problem thread is not stopped, but is signalled, so it should stop soon.
> As a result we don't send another SIGSTOP.
>
> We then enter stop_wait_callback, eventually the problem thread stops
> with SIGSTOP which we _incorrectly_ assume came from stop_callback, and
> we discard.
>
> Once stop_wait_callback has done its damage we return from
> linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
> eventually unwind back to the call to setup_inferior in
> stop_all_threads. GDB now loops around, and performs another
> target_wait to get the next event from the inferior.
>
> The target_wait calls causes us to once again reach linux_nat_wait_1,
> and we pass through some code that calls resume_stopped_resumed_lwps.
> This allows GDB to resume threads that are physically stopped, but which
> GDB doesn't see any good reason for the thread to remain stopped. In
> our case, the problem thread which had its SIGSTOP discarded is stopped,
> but doesn't have a stashed wait status to report, and so GDB sets the
> thread going again.
>
> We are now stuck waiting for an event on the problem thread that might
> never arrive.
>
> When considering how to write a test for this bug I struggled. The
> issue was only spotted _randomly_ when a machine was heavily loaded with
> many multi-threaded applications, and GDB was being attached (by script)
> to all of these applications in parallel. In one reproducer I required
> around 5 applications each of 5 threads per machine core in order to
> reproduce the bug 2 out of 3 times.
>
> What we really want to do though is simulate the kernel being slow to
> report events through waitpid during the initial attach. The solution I
> came up with was to write an LD_PRELOAD library that intercepts (some)
> waitpid calls and rate limits them to one per-second. Any more than
> that simply return 0 indicating there's no event available. Obviously
> this can only be applied to waitpid calls that have the WNOHANG flag
> set.
>
> Unfortunately, this library does break the "rest" of GDB, I suspect that
> the issue is that in some places GDB knows that there's an event
> pending, calls non-blocking waitpid and when the event fails to arrive
> GDB moves into some unexpected/broken state from which it can't recover.
> Still, the preload library does, at the moment, trigger the bug during
> attach.
>
> gdb/ChangeLog:
>
> * linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
> was requested by GDB.
>
> gdb/testsuite/ChangeLog:
>
> * gdb.threads/attach-slow-waitpid.c: New file.
> * gdb.threads/attach-slow-waitpid.exp: New file.
> * gdb.threads/slow-waitpid.c: New file.
> ---
> gdb/ChangeLog | 6 ++
> gdb/linux-nat.c | 14 ++-
> gdb/testsuite/ChangeLog | 7 ++
> gdb/testsuite/gdb.threads/attach-slow-waitpid.c | 77 ++++++++++++++
> gdb/testsuite/gdb.threads/attach-slow-waitpid.exp | 95 ++++++++++++++++++
> gdb/testsuite/gdb.threads/slow-waitpid.c | 117 ++++++++++++++++++++++
> 6 files changed, 312 insertions(+), 4 deletions(-)
> create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c
>
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 445b59fa4ad..d1d0ba66d2e 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -2527,17 +2527,23 @@ stop_wait_callback (struct lwp_info *lp, void *data)
> }
> else
> {
> - /* We caught the SIGSTOP that we intended to catch, so
> - there's no SIGSTOP pending. */
> + /* We caught the SIGSTOP that we intended to catch. */
>
> if (debug_linux_nat)
> fprintf_unfiltered (gdb_stdlog,
> "SWC: Expected SIGSTOP caught for %s.\n",
> target_pid_to_str (lp->ptid));
>
> - /* Reset SIGNALLED only after the stop_wait_callback call
> - above as it does gdb_assert on SIGNALLED. */
> lp->signalled = 0;
> +
> + /* If we are waiting for this stop so we can report the thread
> + stopped then we need to record that this status. Otherwise,
> + we can now discard this stop event. */
> + if (lp->last_resume_kind == resume_stop)
> + {
> + lp->status = status;
> + save_stop_reason (lp);
> + }
> }
> }
>
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.c b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> new file mode 100644
> index 00000000000..06e99ab22d1
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.c
> @@ -0,0 +1,77 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> + Copyright 2018 Free Software Foundation, Inc.
> +
> + This program is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3 of the License, or
> + (at your option) any later version.
> +
> + This program is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + GNU General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with this program. If not, see <http://www.gnu.org/licenses/>. */
> +
> +#include <pthread.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#define NUM_THREADS 4
> +
> +/* Crude spin lock. Threads all spin until this is set to 0. */
> +int go = 1;
> +
> +/* Thread function, just spin until GO is set to 0. */
> +void *
> +perform_work (void *argument)
> +{
> + /* Cast to volatile to ensure that ARGUMENT is loaded each time around
> + the loop. */
> + while (*((volatile int*) argument))
> + {
> + /* Nothing. */
> + }
> + return NULL;
> +}
> +
> +/* The spin loop for the main thread. */
> +void
> +function (void)
> +{
> + (void) perform_work (&go);
> + printf ("Finished from function\n");
> +}
> +
> +/* Main program, create some threads which all spin waiting for GO to be
> + set to 0. */
> +int
> +main (void)
> +{
> + pthread_t threads[NUM_THREADS];
> + int result_code;
> + unsigned index;
> +
> + /* Create some threads. */
> + for (index = 0; index < NUM_THREADS; ++index)
> + {
> + printf ("In main: creating thread %d\n", index);
> + result_code = pthread_create (&threads[index], NULL, perform_work, &go);
> + assert (!result_code);
> + }
> +
> + function ();
> +
> + /* Wait for each thread to complete. */
> + for (index = 0; index < NUM_THREADS; ++index)
> + {
> + /* Block until thread INDEX completes. */
> + result_code = pthread_join (threads[index], NULL);
> + assert (!result_code);
> + printf ("In main: thread %d has completed\n", index);
> + }
> + printf ("In main: All threads completed successfully\n");
> + return 0;
> +}
> diff --git a/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> new file mode 100644
> index 00000000000..d4b12727842
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
> @@ -0,0 +1,95 @@
> +# Copyright 2018 Free Software Foundation, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program. If not, see <http://www.gnu.org/licenses/>.
> +
> +# This test script tries to expose a bug in some of the use of waitpid
> +# in the linux native support within GDB. The problem was spotted on
> +# systems which were heavily loaded when attaching to threaded test
> +# programs. What happened was that during the initial attach, the
> +# loop of waitpid calls that normally received the stop events from
> +# each of the threads in the inferior was not receiving a stop event
> +# for some threads (the kernel just hadn't sent the stop event yet).
> +#
> +# GDB would then trigger a call to stop_all_threads which would
> +# continue to wait for all of the outstanding threads to stop, when
> +# the outstanding stop events finally arrived GDB would then
> +# (incorrectly) discard the stop event, resume the thread, and
> +# continue to wait for the thread to stop.... which it now never
> +# would.
> +#
> +# In order to try and expose this issue reliably, this test preloads a
> +# library that intercepts waitpid calls. And waitpid calls targeting
> +# pid -1, and with the WNOHANG flag are rate limited so that only 1
> +# per second can complete, any additional calls are forced to return 0
> +# indicating no event waiting. This is enough to trigger the bug
> +# during the attach phase, however, it breaks the rest of GDB's
> +# behaviour, so we can't do more than just attach with this library in
> +# place.
> +
> +# This test only works on Linux
> +if { ![isnative] || [is_remote host] || [use_gdb_stub]
> + || ![istarget *-linux*] } {
> + continue
> +}
> +
> +standard_testfile
> +
> +set libfile slow-waitpid
> +set libsrc "${srcdir}/${subdir}/${libfile}.c"
> +set libobj [standard_output_file ${libfile}.so]
> +
> +# Compile the preload library. We only get away with this as we limit
> +# this test to running when ISNATIVE is true.
> +if { [gdb_compile_shlib $libsrc $libobj {debug}] != "" } then {
> + return -1
> +}
> +
> +# Compile the test program
> +if { [gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable {debug}] != "" } {
> + return -1
> +}
> +
> +# Spawn GDB with LIB preloaded with LD_PRELOAD. CMDLINE_OPTS are
> +# command line options passed to GDB.
> +
> +proc gdb_spawn_with_ld_preload {lib cmdline_opts} {
> + global env
> +
> + save_vars { env(LD_PRELOAD) } {
> + if { ![info exists env(LD_PRELOAD) ]
> + || $env(LD_PRELOAD) == "" } {
> + set env(LD_PRELOAD) "$lib"
> + } else {
> + append env(LD_PRELOAD) ":$lib"
> + }
> +
> + gdb_spawn_with_cmdline_opts $cmdline_opts
> + }
> +}
> +
> +# Run test program in the background.
> +set test_spawn_id [spawn_wait_for_attach $binfile]
> +set testpid [spawn_id_get_pid $test_spawn_id]
> +
> +# Start GDB with preload library in place.
> +gdb_spawn_with_ld_preload $libobj ""
> +
> +# Load binary, and attach to running program.
> +gdb_load ${binfile}
> +gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
> +
> +gdb_exit
> +
> +# Kill of test program.
> +kill_wait_spawned_process $test_spawn_id
> diff --git a/gdb/testsuite/gdb.threads/slow-waitpid.c b/gdb/testsuite/gdb.threads/slow-waitpid.c
> new file mode 100644
> index 00000000000..40b5c958899
> --- /dev/null
> +++ b/gdb/testsuite/gdb.threads/slow-waitpid.c
> @@ -0,0 +1,117 @@
> +/* This testcase is part of GDB, the GNU debugger.
> +
> + Copyright 2018 Free Software Foundation, Inc.
> +
> + This program is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3 of the License, or
> + (at your option) any later version.
> +
> + This program is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + GNU General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with this program. If not, see <http://www.gnu.org/licenses/>. */
> +
> +#define _GNU_SOURCE
> +
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <sys/time.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <dlfcn.h>
> +#include <string.h>
> +#include <stdarg.h>
> +
> +/* Logging. */
> +
> +static void
> +log_msg (const char *fmt, ...)
> +{
> +#ifdef LOGGING
> + va_list ap;
> +
> + va_start (ap, fmt);
> + vfprintf (stderr, fmt, ap);
> + va_end (ap);
> +#endif /* LOGGING */
> +}
> +
> +/* Error handling, message and exit. */
> +
> +static void
> +error (const char *fmt, ...)
> +{
> + va_list ap;
> +
> + va_start (ap, fmt);
> + vfprintf (stderr, fmt, ap);
> + va_end (ap);
> +
> + exit (EXIT_FAILURE);
> +}
> +
> +/* Return true (non-zero) if we should skip this call to waitpid, or false
> + (zero) if this waitpid call should be handled with a call to the "real"
> + waitpid library. Allows 1 waitpid call per second. */
> +
> +static int
> +should_skip_waitpid (void)
> +{
> + /* When we last allowed a waitpid to complete. */
> + static struct timeval last_waitpid_time = { 0, 0 };
> +
> + struct timeval *tv = &last_waitpid_time;
> + if (tv->tv_sec == 0)
> + {
> + if (gettimeofday (tv, NULL) < 0)
> + error ("error: gettimeofday failed\n");
> + return 0; /* Don't skip. */
> + }
> + else
> + {
> + struct timeval new_tv;
> +
> + if (gettimeofday (&new_tv, NULL) < 0)
> + error ("error: gettimeofday failed\n");
> +
> + if ((new_tv.tv_sec - tv->tv_sec) < 1)
> + return 1; /* Skip. */
> +
> + memcpy (tv, &new_tv, sizeof (new_tv));
> + }
> +
> + /* Don't skip. */
> + return 0;
> +}
> +
> +/* The waitpid entry point function. */
> +
> +pid_t
> +waitpid (pid_t pid, int *wstatus, int options)
> +{
> + typedef pid_t (*fptr_t) (pid_t, int *, int);
> + static fptr_t real_func = NULL;
> +
> + if (real_func == NULL)
> + {
> + real_func = dlsym (RTLD_NEXT, "waitpid");
> + if (real_func == NULL)
> + error ("error: failed to find real waitpid\n");
> + }
> +
> + log_msg ("waitpid (%d, %p, 0x%x)\n", pid, wstatus, options);
> +
> + if ((options & WNOHANG) != 0
> + && pid == -1
> + && should_skip_waitpid ())
> + {
> + log_msg ("waitpid: skipping\n");
> + return 0;
> + }
> +
> + return (*real_func) (pid, wstatus, options);
> +}
> --
> 2.14.3
>
More information about the Gdb-patches
mailing list