This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[PATCH 3/6] Fix race exposed by gdb.threads/killed.exp

From: Pedro Alves <palves at redhat dot com>
To: gdb-patches at sourceware dot org
Date: Fri, 6 Mar 2015 19:58:03 +0000
Subject: [PATCH 3/6] Fix race exposed by gdb.threads/killed.exp
Authentication-results: sourceware.org; auth=none
References: <1425671886-7798-1-git-send-email-palves at redhat dot com>
On GNU/Linux, this test sometimes FAILs like this:

 (gdb) run
 Starting program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/killed
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib64/libthread_db.so.1".
 ptrace: No such process.
 (gdb)
 Program terminated with signal SIGKILL, Killed.
 The program no longer exists.
 FAIL: gdb.threads/killed.exp: run program to completion (timeout)

Note the suspicious "No such process" line (that's errno==ESRCH).
Adding debug output we see:

  linux_nat_wait: [process -1], [TARGET_WNOHANG]
  LLW: enter
  LNW: waitpid(-1, ...) returned 18465, ERRNO-OK
  LLW: waitpid 18465 received Stopped (signal) (stopped)
  LNW: waitpid(-1, ...) returned 18461, ERRNO-OK
  LLW: waitpid 18461 received Trace/breakpoint trap (stopped)
  LLW: Handling extended status 0x03057f
  LHEW: Got clone event from LWP 18461, new child is LWP 18465
  LNW: waitpid(-1, ...) returned 0, ERRNO-OK
  RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0
  RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0
  sigchld
  ptrace: No such process.
  (gdb) linux_nat_wait: [process -1], [TARGET_WNOHANG]
  LLW: enter
  LNW: waitpid(-1, ...) returned 18465, ERRNO-OK
  LLW: waitpid 18465 received Killed (terminated)
  LLW: LWP 18465 exited.
  LNW: waitpid(-1, ...) returned 18461, No child processes
  LLW: waitpid 18461 received Killed (terminated)
  Process 18461 exited
  LNW: waitpid(-1, ...) returned -1, No child processes
  LLW: exit
  sigchld
  infrun: target_wait (-1, status) =
  infrun:   18461 [process 18461],
  infrun:   status->kind = signalled, signal = GDB_SIGNAL_KILL
  infrun: TARGET_WAITKIND_SIGNALLED

  Program terminated with signal SIGKILL, Killed.
  The program no longer exists.
  infrun: stop_waiting
  FAIL: gdb.threads/killed.exp: run program to completion (timeout)

The ptrace call that fails was a PTRACE_CONTINUE.

The issue is that here:

  RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0
  RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0

The first line shows we had just resumed LWP 18465, which does:

 void *
 child_func (void *dummy)
 {
   kill (pid, SIGKILL);
   exit (1);
 }

So if the kernel manages to schedule that thread fast enough, the
process may be killed before GDB has a chance to resume LWP 18461.

GDBserver has code at the tail end of linux_resume_one_lwp to cope
with this:

~~~
    ptrace (step ? PTRACE_SINGLESTEP : PTRACE_CONT, lwpid_of (thread),
	    (PTRACE_TYPE_ARG3) 0,
	    /* Coerce to a uintptr_t first to avoid potential gcc warning
	       of coercing an 8 byte integer to a 4 byte pointer.  */
	    (PTRACE_TYPE_ARG4) (uintptr_t) signal);

    current_thread = saved_thread;
    if (errno)
      {
	/* ESRCH from ptrace either means that the thread was already
	   running (an error) or that it is gone (a race condition).  If
	   it's gone, we will get a notification the next time we wait,
	   so we can ignore the error.  We could differentiate these
	   two, but it's tricky without waiting; the thread still exists
	   as a zombie, so sending it signal 0 would succeed.  So just
	   ignore ESRCH.  */
	if (errno == ESRCH)
	  return;

	perror_with_name ("ptrace");
      }
~~~

However, that's not a complete fix, because between starting to handle
the resume request and getting that PTRACE_CONTINUE, we run other
ptrace calls that can also fail with ESRCH, and that end up throwing
an error.

So instead of ignoring ESRCH locally at each ptrace call, let ptrace
errors throw a new THREAD_NOT_FOUND_ERROR, and wrap the resume request
in a TRY_CATCH that swallows it.

For now, nothing in the core ever sees or handles this error, but I
envision that we'll find uses for it there too.

gdb/gdbserver/ChangeLog:
2015-03-06  Pedro Alves  <palves@redhat.com>

	* linux-low.c (linux_resume_one_lwp): Rename to ...
	(linux_resume_one_lwp_throw): ... this.  Don't handle ESRCH here,
	instead call throw_ptrace_error.
	(linux_resume_one_lwp): Reimplement as wrapper around
	linux_resume_one_lwp_throw that swallows THREAD_NOT_FOUND_ERROR.

gdb/ChangeLog:
2015-03-06  Pedro Alves  <palves@redhat.com>

	* linux-nat.c (linux_resume_one_lwp): Rename to ...
	(linux_resume_one_lwp_throw): ... this.  Don't handle ESRCH here,
	instead call throw_ptrace_error.
	(linux_resume_one_lwp): Reimplement as wrapper around
	linux_resume_one_lwp_throw that swallows THREAD_NOT_FOUND_ERROR.
---
 gdb/common/common-exceptions.h |  3 +++
 gdb/gdbserver/linux-low.c      | 54 +++++++++++++++++++++++++++++-------------
 gdb/linux-nat.c                | 32 ++++++++++++++++++++++++-
 gdb/nat/ptrace-utils.c         | 22 ++++++++++++++++-
 4 files changed, 93 insertions(+), 18 deletions(-)

diff --git a/gdb/common/common-exceptions.h b/gdb/common/common-exceptions.h
index a32e6f9..1a13f18 100644
--- a/gdb/common/common-exceptions.h
+++ b/gdb/common/common-exceptions.h
@@ -93,6 +93,9 @@ enum errors {
      aborted as the inferior state is no longer valid.  */
   TARGET_CLOSE_ERROR,
 
+  /* The thread is gone.  */
+  THREAD_NOT_FOUND_ERROR,
+
   /* An undefined command was executed.  */
   UNDEFINED_COMMAND_ERROR,
 
diff --git a/gdb/gdbserver/linux-low.c b/gdb/gdbserver/linux-low.c
index 48d905b..f9c4079 100644
--- a/gdb/gdbserver/linux-low.c
+++ b/gdb/gdbserver/linux-low.c
@@ -28,6 +28,7 @@
 #include "nat/linux-ptrace.h"
 #include "nat/linux-procfs.h"
 #include "nat/linux-personality.h"
+#include "nat/ptrace-utils.h"
 #include <signal.h>
 #include <sys/ioctl.h>
 #include <fcntl.h>
@@ -3378,13 +3379,12 @@ stop_all_lwps (int suspend, struct lwp_info *except)
     }
 }
 
-/* Resume execution of the inferior process.
-   If STEP is nonzero, single-step it.
-   If SIGNAL is nonzero, give it that signal.  */
+/* Like linux_resume_one_lwp but throws THREAD_NOT_FOUND_ERROR if the
+   LWP has disappeared (ptrace returns ESRCH).  */
 
 static void
-linux_resume_one_lwp (struct lwp_info *lwp,
-		      int step, int signal, siginfo_t *info)
+linux_resume_one_lwp_throw (struct lwp_info *lwp,
+			    int step, int signal, siginfo_t *info)
 {
   struct thread_info *thread = get_lwp_thread (lwp);
   struct thread_info *saved_thread;
@@ -3576,18 +3576,40 @@ linux_resume_one_lwp (struct lwp_info *lwp,
 
   current_thread = saved_thread;
   if (errno)
-    {
-      /* ESRCH from ptrace either means that the thread was already
-	 running (an error) or that it is gone (a race condition).  If
-	 it's gone, we will get a notification the next time we wait,
-	 so we can ignore the error.  We could differentiate these
-	 two, but it's tricky without waiting; the thread still exists
-	 as a zombie, so sending it signal 0 would succeed.  So just
-	 ignore ESRCH.  */
-      if (errno == ESRCH)
-	return;
+    throw_ptrace_error ("resuming thread");
+}
 
-      perror_with_name ("ptrace");
+/* Resume execution of LWP.  If STEP is nonzero, single-step it.  If
+   SIGNAL is nonzero, give it that signal.  No error is throw if the
+   LWP disappears while we try to resume it, no error is thrown.  */
+
+static void
+linux_resume_one_lwp (struct lwp_info *lwp,
+		      int step, int signal, siginfo_t *info)
+{
+  struct gdb_exception ex;
+
+  TRY_CATCH (ex, RETURN_MASK_ERROR)
+    {
+      linux_resume_one_lwp_throw (lwp, step, signal, info);
+    }
+  if (ex.reason < 0)
+    {
+      if (ex.error == THREAD_NOT_FOUND_ERROR)
+	{
+	  /* We got an ESRCH while resuming the LWP.  ESRCH from
+	     ptrace either means that the thread was already running
+	     (an error/bug) or that it is gone (a race condition, e.g,
+	     another thread exited the whole process.), or it was
+	     killed by SIGKILL.  If it's gone, we will get a
+	     notification the next time we wait, so we can ignore the
+	     error.  We could differentiate these, but it's tricky
+	     without waiting; the thread still exists as a zombie, so
+	     sending it signal 0 would succeed.  So just ignore
+	     ESRCH.  */
+	}
+      else
+	throw_exception (ex);
     }
 }
 
diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 20fe533..6bb62fd 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1504,7 +1504,7 @@ linux_nat_detach (struct target_ops *ops, const char *args, int from_tty)
    single-step it.  If SIGNAL is nonzero, give it that signal.  */
 
 static void
-linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo)
+linux_resume_one_lwp_throw (struct lwp_info *lp, int step, enum gdb_signal signo)
 {
   lp->step = step;
 
@@ -1528,6 +1528,36 @@ linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo)
   registers_changed_ptid (lp->ptid);
 }
 
+static void
+linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo)
+{
+  struct gdb_exception ex;
+
+  TRY_CATCH (ex, RETURN_MASK_ERROR)
+    {
+      linux_resume_one_lwp_throw (lp, step, signo);
+    }
+  if (ex.reason < 0)
+    {
+      if (ex.error == THREAD_NOT_FOUND_ERROR)
+	{
+	  /* We got an ESRCH while resuming the LWP.  ESRCH from
+	     ptrace either means that the thread was already running
+	     (an error/bug) or that it is gone (a race condition, e.g,
+	     another thread exited the whole process.), or it was
+	     killed by SIGKILL.  If it's gone, we will get a
+	     notification the next time we wait, so we can ignore the
+	     error.  We could differentiate these, but it's tricky
+	     without waiting; the thread still exists as a zombie, so
+	     sending it signal 0 would succeed.  So just ignore
+	     ESRCH.  */
+	}
+      else
+	throw_exception (ex);
+    }
+}
+
+
 /* Resume LP.  */
 
 static void
diff --git a/gdb/nat/ptrace-utils.c b/gdb/nat/ptrace-utils.c
index a3dc9b5..ca59454 100644
--- a/gdb/nat/ptrace-utils.c
+++ b/gdb/nat/ptrace-utils.c
@@ -25,5 +25,25 @@
 void
 throw_ptrace_error (const char *string)
 {
-  perror_with_name (string);
+  enum errors errcode;
+
+  switch (errno)
+    {
+    case ESRCH:
+      /* This either means that:
+
+	 - the thread was already running (a GDB error/bug).
+
+	 - the thread is gone, a race condition.  E.g, another thread
+	   was set running and it exited the whole process, or the
+	   process was killed by a SIGKILL.
+
+	 Assume the latter.  */
+      errcode = THREAD_NOT_FOUND_ERROR;
+      break;
+    default:
+      errcode = GENERIC_ERROR;
+      break;
+    }
+  throw_perror_with_name (errcode, string);
 }
-- 
1.9.3
Follow-Ups:
- Re: [PATCH 3/6] Fix race exposed by gdb.threads/killed.exp
  - From: Pedro Alves
References:
- [PATCH 0/6] Fix problems if inferior disappears while being debugged
  - From: Pedro Alves
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]