This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Fix PR 12702 - gdb can hang waiting for thread group leader (gdbserver)

From: Pedro Alves <palves at redhat dot com>
To: Hui Zhu <teawater at gmail dot com>
Cc: gdb-patches ml <gdb-patches at sourceware dot org>, "Gustavo, Luis" <luis_gustavo at mentor dot com>
Date: Thu, 23 Jan 2014 14:00:21 +0000
Subject: Re: [PATCH] Fix PR 12702 - gdb can hang waiting for thread group leader (gdbserver)
Authentication-results: sourceware.org; auth=none
References: <CANFwon2cWyxzWi+Sqfk6mUK0ZCWQRisMmGWe6mxPMgDLTV_j2g at mail dot gmail dot com> <51F01C13 dot 3080402 at redhat dot com> <CANFwon2jf-d8RNN3wScXZDe7ETm=rQ5CyOY723XLegpgFVb3XQ at mail dot gmail dot com>

On 11/19/2013 06:15 AM, Hui Zhu wrote:
> On Thu, Jul 25, 2013 at 2:25 AM, Pedro Alves <palves@redhat.com> wrote:
>> > On 07/20/2013 10:59 AM, Hui Zhu wrote:
>>> >> Hi,
>>> >>
>>> >> I got a issue with leader-exit.exp will hang with target_board=native-gdbserver.
>>> >> I found the reason is gdbserver still have this issue.  So I post a
>>> >> patch for gdbserver.
>> >
>> > leader-exit.exp is part of a series of tests for a set of related
>> > problems.  See:
>> >
>> >   http://www.sourceware.org/ml/gdb-patches/2011-10/msg00704.html
>> >
>> > Several are currently masked because gdbserver doesn't support
>> > fork/exec yet.
>> >
>> > We should fix this by implementing TARGET_WAITKIND_NO_RESUMED
>> > on gdbserver, like the patch above did for native.
>> >
>> > --
>> > Pedro Alves
>> >
> Hi Pedro,
> 
> According to your comments, I make a new patch to add
> TARGET_WAITKIND_NO_RESUMED to gdbserver.
> It can pass the leader-exit.exp and pass the regression test.
> 
> Please help me review it.

Hi Hui.  Sorry for the time it takes to get through patches
like this.  This is hardly trivial code...

> --- a/gdb/gdbserver/linux-low.h
> +++ b/gdb/gdbserver/linux-low.h
> @@ -253,6 +253,9 @@ struct lwp_info
>       event already received in a wait()).  */
>    int stopped;
>
> +  /* If this flag is set, the lwp has a wait status pending.  */
> +  int pending_stop;
> +

You didn't explain it, so it took me a bit to grok this, but I
figured out that essentially, you're adding yet another layer of
pending events (a lower one).  Hmm.  I'd really like to avoid that.

> +      /* If there are no resumed children left, bail.  We'd be stuck
> +	 forever in the sigsuspend call below otherwise.
> +	 OR no interesting event to report to the core.*/
> +      if ((find_inferior (&all_lwps, not_suspended_callback, &ptid) == NULL)
> +	  || (options & WNOHANG))
> +	{

Not clear to me why checking the suspended flag would be the
right check when determining whether there are no resumed LWPs
left.

> +/* Wait for LP to stop.  Returns the wait status, or 0 if the LWP has
> +   exited.  */
> +
> +static int
> +wait_lwp (struct lwp_info *lwp)
> +{
> +  pid_t pid;
> +  int status = 0;
> +  int thread_dead = 0;
> +  sigset_t prev_mask;
> +
> +  /* Make sure SIGCHLD is blocked for sigsuspend avoiding a race below.  */
> +  sigprocmask (SIG_BLOCK, &sigchld_mask, &prev_mask);
> +
> +  for (;;)
> +    {
> +      sigset_t suspend_mask;
> +
> +      /* If my_waitpid returns 0 it means the __WCLONE vs. non-__WCLONE kind
> +	 was right and we should just call sigsuspend.  */
> +
> +      pid = my_waitpid (ptid_get_lwp (lwp->head.id), &status, WNOHANG);
> +      if (pid == -1 && errno == ECHILD)
> +	pid = my_waitpid (ptid_get_lwp (lwp->head.id), &status, __WCLONE | WNOHANG);
> +      if (pid == -1 && errno == ECHILD)
> +	{
> +	  /* The thread has previously exited.  We need to delete it
> +	     now because, for some vendor 2.4 kernels with NPTL
> +	     support backported, there won't be an exit event unless
> +	     it is the main thread.  2.6 kernels will report an exit
> +	     event for each thread that exits, as expected.  */
> +	  thread_dead = 1;
> +	  if (debug_threads)
> +	    fprintf (stderr, "WL: %s vanished.\n",
> +		     target_pid_to_str (lwp->head.id));
> +	}
> +      if (pid != 0)
> +	break;
> +
> +      /* Bugs 10970, 12702.
> +	 Thread group leader may have exited in which case we'll lock up in
> +	 waitpid if there are other threads, even if they are all zombies too.
> +	 Basically, we're not supposed to use waitpid this way.
> +	 __WCLONE is not applicable for the leader so we can't use that.
> +	 LINUX_NAT_THREAD_ALIVE cannot be used here as it requires a STOPPED
> +	 process; it gets ESRCH both for the zombie and for running processes.
> +
> +	 As a workaround, check if we're waiting for the thread group leader and
> +	 if it's a zombie, and avoid calling waitpid if it is.
> +
> +	 This is racy, what if the tgl becomes a zombie right after we check?
> +	 Therefore always use WNOHANG with sigsuspend - it is equivalent to
> +	 waiting waitpid but linux_proc_pid_is_zombie is safe this way.  */
> +
> +      if (ptid_get_pid (lwp->head.id) == ptid_get_lwp (lwp->head.id)
> +	  && linux_proc_pid_is_zombie (ptid_get_lwp (lwp->head.id)))
> +	{
> +	  thread_dead = 1;
> +	  if (debug_threads)
> +	    fprintf (stderr, "WL: Thread group leader %s vanished.\n",
> +		     target_pid_to_str (lwp->head.id));
> +	  break;
> +	}
> +
> +      /* Wait for next SIGCHLD and try again.  This may let SIGCHLD handlers
> +	 get invoked despite our caller had them intentionally blocked by
> +	 block_child_signals.  This is sensitive only to the loop of
> +	 linux_nat_wait_1 and there if we get called my_waitpid gets called
> +	 again before it gets to sigsuspend so we can safely let the handlers
> +	 get executed here.  */
> +
> +      /* Block until we get an event reported with SIGCHLD.  */
> +      sigprocmask (SIG_SETMASK, NULL, &suspend_mask);
> +      sigdelset (&suspend_mask, SIGCHLD);
> +      sigsuspend (&suspend_mask);
> +    }
> +
> +  sigprocmask (SIG_SETMASK, &prev_mask, NULL);
> +
> +  if (!thread_dead)
> +    {
> +      gdb_assert (pid == ptid_get_lwp (lwp->head.id));
> +
> +      if (debug_threads)
> +	fprintf (stderr, "WL: waitpid %s received %x\n",
> +		 target_pid_to_str (lwp->head.id), status);
> +
> +      /* Check if the thread has exited.  */
> +      if (WIFEXITED (status) || WIFSIGNALED (status))
> +	{
> +	  thread_dead = 1;
> +	  if (debug_threads)
> +	    fprintf (stderr, "WL: %s exited.\n",
> +		     target_pid_to_str (lwp->head.id));
> +	}
> +    }
> +
> +  if (thread_dead)
> +    {
> +      delete_lwp (lwp);
> +      return 0;
> +    }
> +
> +  gdb_assert (WIFSTOPPED (status));
> +
> +  lwp->stopped = 1;
> +  lwp->last_status = status;
> +  if (WIFSTOPPED (status))
> +    {
> +      lwp->stop_pc = get_stop_pc (lwp);
> +      if (WSTOPSIG (status) == SIGSTOP && lwp->stop_expected)
> +	lwp->stop_expected = 0;
> +    }
> +
> +  if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
> +    {
> +      if (the_low_target.stopped_by_watchpoint == NULL)
> +	{
> +	  lwp->stopped_by_watchpoint = 0;
> +	}
> +      else
> +	{
> +	  struct thread_info *saved_inferior;
> +
> +	  saved_inferior = current_inferior;
> +	  current_inferior = get_lwp_thread (lwp);
> +
> +	  lwp->stopped_by_watchpoint
> +	    = the_low_target.stopped_by_watchpoint ();
> +
> +	  if (lwp->stopped_by_watchpoint)
> +	    {
> +	      if (the_low_target.stopped_data_address != NULL)
> +		lwp->stopped_data_address
> +		  = the_low_target.stopped_data_address ();
> +	      else
> +		lwp->stopped_data_address = 0;
> +	    }
> +
> +	  current_inferior = saved_inferior;
> +	}
> +    }
> +
> +  /* Handle GNU/Linux's syscall SIGTRAPs.  */
> +  if (WIFSTOPPED (status) && WSTOPSIG (status) == SYSCALL_SIGTRAP)
> +    {
> +      if (debug_threads)
> +	fprintf (stderr, "WL: Handling syscall SIGTRAPs\n");
> +      /* No longer need the sysgood bit.  The ptrace event ends up
> +	 recorded in lp->waitstatus if we care for it.  We can carry
> +	 on handling the event like a regular SIGTRAP from here
> +	 on.  */
> +      status = W_STOPCODE (SIGTRAP);
> +      ptrace (PTRACE_CONT, ptid_get_lwp (lwp->head.id), 0, 0);
> +      return wait_lwp (lwp);
> +    }
> +
> +  /* Handle GNU/Linux's extended waitstatus for trace events.  */
> +  if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP && status >> 16 != 0)
> +    {
> +      if (debug_threads)
> +	fprintf (stderr, "WL: Handling extended status 0x%06x\n", status);
> +      handle_extended_wait (lwp, status);
> +      return wait_lwp (lwp);
> +    }
> +
> +  return status;
> +}

This is main gripe with this patch.  I'd really like to avoid
bringing in more of this broken use of waitpid(PID) into gdbserver
(I realize most of this is copied from GDB), and this duplication
of low level wait status handling.  I think we can do
better in gdbserver.  There's really no need for wait_for_sigstop
to wait for each LWP in turn.   I think that with some
refactoring, we can make it reuse linux_wait_for_event, and
only end up with having to handle pending statuses in one place.

So I've been working on this most of this week, and I got this
almost done but I'm afraid I'll need to give attention to other
work, so I'm not sure I'll be able to be back at this before
February.  Anyway, I'll post a WIP series in response to this email.
In addition, it goes a step forward and starts adding support for
TARGET_WAITKIND_NO_RESUMED to the RSP as well, which is needed for
fully fixing no-unwaited-for-left.exp.

I've put the series in the my github as well:

 git@github.com:palves/gdb.git TARGET_WAITKIND_NO_RESUMED_gdbserver

-- 
Pedro Alves

Follow-Ups:
- [PATCH 1/5] Move ptid_match to common/ptid.c.
  - From: Pedro Alves

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]