This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: RFH: failed assert debugging threaded+fork program over gdbserver

From: "Breazeal, Don" <Don_Breazeal at mentor dot com>
To: Joel Brobecker <brobecker at adacore dot com>
Cc: "gdb-patches at sourceware dot org" <gdb-patches at sourceware dot org>
Date: Fri, 24 Jun 2016 21:52:17 +0000
Subject: RE: RFH: failed assert debugging threaded+fork program over gdbserver
Authentication-results: sourceware.org; auth=none
References: <20160512171650 dot GC26324 at adacore dot com> <5734C06C dot 8040008 at codesourcery dot com> <20160623225935 dot GC3295 at adacore dot com>


> -----Original Message-----
> From: Joel Brobecker [mailto:brobecker@adacore.com]
> Sent: Thursday, June 23, 2016 4:00 PM
> To: Breazeal, Don
> Cc: gdb-patches@sourceware.org
> Subject: Re: RFH: failed assert debugging threaded+fork program over
> gdbserver
> 
> Hello,
> 
> I finally had some time to look into this again. I think I have a
> better understanding.
> 
> On Thu, May 12, 2016 at 10:42:04AM -0700, Don Breazeal wrote:
> > On 5/12/2016 10:16 AM, Joel Brobecker wrote:
> > > Hello,
> > >
> > > I have noticed the following problem, when debugging a program
> which
> > > uses both threads and fork. The program is attached in copy, and
> it
> > > was compiled by simply doing:
> > >
> > >     % gnatmake -g a_test
> > >
> > > The issue appears only randomly, but it seems to show up fairly
> > > reliably when using certain versions of GNU/Linux such as RHES7,
> or
> > > WRSLinux. I also see it on Ubuntu, but less reliably. Here is what
> I
> > > have found, debugging on WRSLinux (we set it up as a cross, but it
> > > should be the same with native GNU/Linux distros):
> > >
> > >     % gdb a_test
> > >     (gdb) break a_test.adb:30
> > >     (gdb) break a_test.adb:39
> > >     (gdb) target remote my_board:4444
> > >     (gdb) continue
> > >     Continuing.
> > >     [...]
> > >     [New Thread 866.868]
> > >     [New Thread 866.869]
> > >     [New Thread 870.870]
> > >     /[...]/gdb/thread.c:89: internal-error: thread_info*
> inferior_thread(): Assertion `tp' failed.
> > >     A problem internal to GDB has been detected,
> > >     further debugging may prove unreliable.
> > >     Quit this debugging session? (y or n)
> > >
> > > The error happens because GDBserver returns a list of threads to
> GDB
> > > where a new thread as a different PID (870 in the case above,
> > > instead of 866).
> >
> > This is not supposed to happen.  In
> remote.c:remote_update_thread_list
> > there is a call to remove_new_fork_children that is explicitly
> > supposed to prevent this scenario.  The new fork child thread should
> > be deleted from the thread list ("context") before we call
> remote_notice_new_inferior.
> >
> > We don't want the remote to report new threads related to the fork
> > child until after we have handled the fork using
> infrun.c:follow_fork. (Note:
> > it looks like the function comment for remove_new_fork_children is
> > stale, since there are scenarios other than being stopped at a fork
> > catchpoint where this can occur.)
> >
> > > Any advice on how I should be fixing the issue?
> >
> > It looks like there is a case that remove_new_fork_children isn't
> > handling correctly with your test+target(s). I'd start there to find
> > out how the new thread is getting through.
> 
> I started by looking at remove_new_fork_children, which, in essence,
> remove threads whose target_waitstatus->kind is fork or vforked.
> And for that to happen, I think the remote (gdbserver) has to send GDB
> a stop:fork packet, something like this:
> 
>     Packet received:
> T05fork:p4bc4.4bc4;06:0100000000000000;07:d0ddffffff7f0000;10:6a4f83f7
> ff7f0000;thread:p4ba5.4ba5;core:5;
> 
> However, this is not what happens, in our case. When continuing, we
> just receive a SIGTRAP event, at which point GDB asks for the thread
> list, and by then, the thread for the forked process is in the list,
> leading to the situation described above.
> 
> Deciphering the gdbserver logs, I think I understand why this is
> happening. After we're done single-stepping out of our breakpoint, and
> we have resumed our program we go into a wait loop in
> linux_wait_for_event_filtered, which, as I understand it, pulls event
> from the kernel, but without doing a blocking wait (WNOHANG), and as
> long as events are found, we keep waiting for more:
> 
>   /* Always pull all events out of the kernel.  We'll randomly select
>      an event LWP out of all that have events, to prevent
>      starvation.  */
>   while (event_child == NULL)
>     {
>       [...]
>       ret = my_waitpid (-1, wstatp, options | WNOHANG);
>       if (ret > 0)
>         {
>           [...]
>           continue;
>         }
> 
> After resuming, we go through a couple of loops where I think we
> discover the new threads that are getting started. Each time, we
> resume the threads via:
> 
>       /* Now that we've pulled all events out of the kernel, resume
>          LWPs that don't have an interesting event to report.  */
>       if (stopping_threads == NOT_STOPPING_THREADS)
>         for_each_inferior (&all_threads, resume_stopped_resumed_lwps);
> 
> But shortly after, we get the following batch of events:
> 
> | leader_pid=2178, leader_lp!=NULL=1, num_lwps=3, zombie=0
> | sigsuspend'ing sigchld_handler my_waitpid (-1, 0x40000001)
> my_waitpid
> | (-1, 0x40000001): status(137f), 2190
> | LWFE: waitpid(-1, ...) returned 2190, ERRNO-OK
> | LLW: waitpid 2190 received Stopped (signal) (stopped) my_waitpid (-
> 1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2189
> | LWFE: waitpid(-1, ...) returned 2189, ERRNO-OK
> | LLW: waitpid 2189 received Trace/breakpoint trap (stopped) pc is
> | 0x1000336c
> | CSBB: LWP 2178.2189 stopped by software breakpoint my_waitpid (-1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2188
> | LWFE: waitpid(-1, ...) returned 2188, ERRNO-OK
> | LLW: waitpid 2188 received Trace/breakpoint trap (stopped) pc is
> | 0x10003218
> | CSBB: LWP 2178.2188 stopped by software breakpoint my_waitpid (-1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(1057f), 2178
> | LWFE: waitpid(-1, ...) returned 2178, ERRNO-OK
> | LLW: waitpid 2178 received Trace/breakpoint trap (stopped) pc is
> | 0xfe87158
> | HEW: Got fork event from LWP 2178, new child is 2190  <<<<--- fork!
> |
> | pc is 0xfe87158
> | my_waitpid (-1, 0x40000001)
> | my_waitpid (-1, 0x40000001): status(ffffffff), 0
> | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK
> 
> This tells me that we get 3 SIGTRAP events from our program's threads
> hitting a breakpoint, *and* the fork event. The fork event is
> confirmed by some logs printed right after:
> 
> | LWP 2178: extended event with waitstatus status->kind = forked Hit a
> | non-gdbserver trap event.
> 
> Now that we've have interesting stuff to report, we decide to stop all
> LWPS:
> 
> | >>>> entering stop_all_lwps
> | stop_all_lwps (stop, except=none)
> | wait_for_sigstop: pulling events
> | my_waitpid (-1, 0x40000001)
> | my_waitpid (-1, 0x40000001): status(ffffffff), 0
> | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK leader_pid=2178,
> | leader_lp!=NULL=1, num_lwps=3, zombie=0 leader_pid=2190,
> | leader_lp!=NULL=1, num_lwps=1, zombie=0
> | LLW: exit (no unwaited-for LWP)
> | stop_all_lwps done, setting stopping_threads back to !stopping <<<<
> | exiting stop_all_lwps
> 
> Because all our threads had hit a breakpoint, we can see above that
> none of the lwps got sent a sigstop signal, and therefore waitpid
> returns zero as expected - confirmed by "(no unwaited-for LWP)".
> 
> And this is where I think things go wrong - instead of reporting the
> fork event to GDB, we look at all our threads, find that none of them
> was being single-stepped, and so select one of the threads that got a
> SIGTRAP at random, reporting the event of that thread:
> 
> | SEL: Found 3 SIGTRAP events, selecting #1 Checking whether LWP 2178
> | needs to move out of the jump pad...no Checking whether LWP 2190
> needs
> | to move out of the jump pad...no
> | linux_wait_1 ret = LWP 2178.2188, 1, 5 <<<< exiting linux_wait_1
> | Writing resume reply for LWP 2178.2188:1 putpkt
> | ("$T05swbreak:;01:482009e0;40:10003218;thread:p882.88c;core:0;#75");
> | [noack mode]
> 
> This is in linux-low.c::select_event_lwp
> 
> I'm thinking the right thing to do, here is to enhance
> select_event_lwp to look for threads that received a fork/vfork event
> first, and report that event to gdb ahead of all the other kinds of
> events.
> 
> But I ran out of time to try this today - so I decided to send my
> notes about this so far. I hope I am on the right track!
> 
> --
> Joel

Hi Joel,
Thanks for following up on this.  Your diagnosis makes sense to me.  As for modifications to select_event_lwp, Pedro would be better suited to comment on potential side effects.
Thanks
--Don

References:
- Re: RFH: failed assert debugging threaded+fork program over gdbserver
  - From: Joel Brobecker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]