Bug 14382

Summary: gdb hangs after plotting with matplotlib
Product: gdb Reporter: joeneeman
Component: pythonAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: aegges, dje, palves, tromey, tromey, xdje42
Priority: P2    
Version: 7.4   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Python command that hangs gdb.
Qt command that hangs gdb
Simple multithreading that hangs gdb
make linux-nat.c use signalfd

Description joeneeman 2012-07-21 15:27:48 UTC
Created attachment 6542 [details]
Python command that hangs gdb.

I'm attempting to write a gdb command in python that will display some data in a matplotlib plot. The plotting itself works fine: the plot is shown and I can interact with it. While the plot is open, the (gdb) prompt is not shown, as mp.show() is blocking.

After I close the plot, the problems begin. I am returned to the (gdb) prompt and I can issue certain commands (for example, "list" works as expected). However, if I attempt to continue running the program being debugged (eg. with "next" or "continue"), then gdb becomes unresponsive (including to ^C).

You can reproduce the problem with the file I have attached (requires matplotlib to be installed):

echo "int main() { return 0; }" > gdbtest.c
gcc -g -o gdbtest gdbtest.c
gdb gdbtest

(at the (gdb) prompt)
source gdbtest2.py
plot
(close the plot window)
run
(gdb hangs)

I am using gdb 7.4.1 on arch linux.
Comment 1 Tom Tromey 2012-07-31 15:53:52 UTC
I tried this with 7.4 (from cvs) on Fedora 16.
I'm using:

barimba. rpm -q python-matplotlib
python-matplotlib-1.0.1-12.fc16.x86_64

I ran gdb on itself, tried "plot", then "run".
It all seems to work for me.

I do think something could go wrong if your Python library
tries to install new signal handlers, or if it calls
one of the wait family of functions without specifying a PID.
That doesn't seem to be happening for me, but maybe we
have different versions or something.
Comment 2 joeneeman 2012-08-07 00:35:05 UTC
Created attachment 6567 [details]
Qt command that hangs gdb
Comment 3 joeneeman 2012-08-07 02:15:29 UTC
Thanks for following this up. I can confirm that matplotlib-1.0.1 doesn't cause
a problem (on either 7.4.1 or latest CVS), but I'm still having a problem with
matplotlib-1.1.1. Actually, I can reproduce the problem with a simple Qt
application (see the new attachment). I can sometimes reproduce the problem
with the equivalent GTK application, but not reliably (ie. I can sometimes go
on debugging for a minute or so before it hangs).
Comment 4 Tom Tromey 2012-08-15 21:33:26 UTC
strace catches this code changing the SIGCHLD handler:

rt_sigaction(SIGCHLD, {0x3716d53d70, [], SA_RESTORER|SA_NOCLDSTOP, 0x36d5c0f500}, {0x5e28c3, [], SA_RESTORER|SA_RESTART, 0x36d5c0f500}, 8) = 0

I suspect this is what breaks gdb.
Comment 5 Tom Tromey 2012-08-16 14:24:00 UTC
In particular it is the use of SA_NOCLDSTOP.
If I disable this (by breaking on __sigaction and
tweaking the flags), then gdb works again.

I am not sure what to suggest here.  You could file
a bug with Qt, I suppose.  That is probably a pretty
long road to getting a fix.

If you wanted to do a hack you could arrange to load
some other .so after loading Qt, and have this .so
reset the signal flags.

I don't know of a good way for gdb to work around
this problem.
Comment 6 joeneeman 2012-08-17 00:51:01 UTC
I'm working around it by having the GUI in a separate process, which works just fine. I guess I just had too high expectations after reading (what I now see is) your blog: http://tromey.com/blog/?p=550

Thanks for your help!
Comment 7 Pedro Alves 2012-08-22 15:31:43 UTC
> In particular it is the use of SA_NOCLDSTOP.
> If I disable this (by breaking on __sigaction and
> tweaking the flags), then gdb works again.

> I am not sure what to suggest here.

Yeah.  Such is the nature of signal based interfaces.  Not much GDB could
do, other than moving all ptrace processing to a separate process (hmm, like, ..., always spawning gdbserver).
Comment 8 Tom Tromey 2012-08-22 16:58:32 UTC
I filed it upstream

https://bugreports.qt-project.org/browse/QTBUG-26947

This is yet another argument, or maybe just the same one :),
for some kind of "ptrace fd" though.
Comment 9 Marc Br√ľnink 2013-03-07 09:16:41 UTC
Created attachment 6919 [details]
Simple multithreading that hangs gdb

Another way to reproduce similar behaviour. Run gdb with a multi-threaded program and import matplotlib.pyplot
(version 1.1.1)

gdb  ./testc_multithreading
(gdb) python import matplotlib.pyplot
(gdb) r 128
Comment 10 Tom Tromey 2015-05-24 19:34:18 UTC
(In reply to Tom Tromey from comment #5)

> I don't know of a good way for gdb to work around
> this problem.

I imagine signalfd could be used now.
Comment 11 Pedro Alves 2015-05-25 08:36:33 UTC
Not sure how.

It is worth pointing out that the Qt and kernel folks working on a alternative that allows waiting for child exits without using signals at all:

  https://lwn.net/Articles/638613/
Comment 12 Pedro Alves 2015-05-25 08:39:13 UTC
I believe debugging against gdbserver instead of native gdb would work around this (simply because then it is gdbserver that wants SIGCHLD, not gdb), but I haven't tried it.
Comment 13 Tom Tromey 2015-05-25 20:52:04 UTC
(In reply to Pedro Alves from comment #11)
> Not sure how.

I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the
man page says to do) then the various sigaction calls that libraries
may do will just be ignored.

This is kind of gross though.

> It is worth pointing out that the Qt and kernel folks working on a
> alternative that allows waiting for child exits without using signals at all:
> 
>   https://lwn.net/Articles/638613/

That does look promising.
Comment 14 Pedro Alves 2015-05-26 08:29:25 UTC
(In reply to Tom Tromey from comment #13)
> I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the
> man page says to do) then the various sigaction calls that libraries
> may do will just be ignored.

Ah, yes, but then the libraries were messing with SIGCHLD's sigaction because they want to be notified of child exits, which would end up broken for them.  Seems like a "what's the better breakage" decision case, and that may indeed be better breakage.
Comment 15 Pedro Alves 2015-05-26 08:56:20 UTC
(In reply to Tom Tromey from comment #13)
> I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the
> man page says to do) then the various sigaction calls that libraries
> may do will just be ignored.

Hmm, doesn't look like the sigactions are actually ignored, according to:

 http://stackoverflow.com/questions/20228092/can-there-be-a-race-between-signalfd-and-sigaction

If both the handler setup by the library is called, and gdb's signalfd descriptor is waken, then it might be just what we need.
Comment 16 Pedro Alves 2015-05-26 10:26:12 UTC
Created attachment 8333 [details]
make linux-nat.c use signalfd

I tried making linux-nat.c use signalfd now.  Unless I missed something, it doesn't seem to actually work for protecting against a library stealing SIGCHLDs though, so it ends up useless.  :-/

See the #if 0 in the patch:

~~~
#if 0
      /* Hmm, if we don't do this, then gdb hangs in wait_for_sigchld.
	 That seems to mean that if a SIGCHLD signal handler is
	 called, then the signalfd file ends up with nothing to read,
	 and thus 'select' blocks forever.  Test with:

	 "gdb PROGRAM -ex "maint set target-async off" -ex "set debug lin-lwp 1"

	 Which renders this approach worthless to protect against a
	 library GDB links against from stealing out SIGCHLD
	 handler... :-/
      */
~~~~~~~~~~~~~~~~~~

If I leave that on, things work, but if I disable it, I get:

~~~
 $ gdb ~/gdb/tests/threads  -ex "maint set target-async off" -ex "set debug lin-lwp 1"
 (gdb) start
 (...)
 LLW: exit
 LLR: Preparing to resume process 15243, 0, inferior_ptid process 15243
 LLR: PTRACE_CONT process 15243, 0 (resume event thread)
 linux_nat_wait: [process 15243], []
 LLW: enter
 LNW: waitpid(-1, ...) returned 0, No child processes
 LNW: about to sigsuspend
 sigchld
 ^C^C^C^C^C^C^C^C *gdb hang*
~~~

Adding more logging in wait_for_sigchld, I could observe that
sometimes, the select returns 1, and then the read fails with EAGAIN.
Thinking that perhaps treating that as "got SIGCHLD, but some handler
consumed it" makes a "(gdb) start" sometimes work, but not always...  It
seems that the select only wakes up / returns 1 if the signal arrives
while gdb is already blocked inside select.  If the signal arrives and
is handled before gdb reaches select, then select hangs forever.  This
kind of makes sense if the signalfd's "data" is generated on the fly
from the pending signals, not really queued in the kernel, which it is
probably how this works on the kernel side...
Comment 17 Pedro Alves 2015-05-26 10:30:17 UTC
(Note that "maint set target-async off" is not really necessary to reach the sigsuspend paths; it was just a way to make them more frequent, for quicker testing.)
Comment 18 Pedro Alves 2015-05-26 10:32:21 UTC
(... and also that if we got past inferior startup (which is always synchronous), gdb's select/poll in the main event loop hangs in the same way, when target-async is left on.)
Comment 19 Tom Tromey 2015-05-27 03:06:45 UTC
Thanks for trying that!
It's a bummer that it didn't work.

I have a workaround that works for my case -- it's a way to block SIGCHLD
in the new thread just in my application -- so I guess a better
fix can wait for the kernel to catch up.

This area is one of the main reasons I wanted PTRACE_FD...

Also I wonder whether gdb should be using pthread_sigmask
instead of sigprocmask now.
Comment 20 Doug Evans 2015-05-27 16:36:01 UTC
for reference sake,
o11c I think suggested on #gdb having gdb start a thread to take SIGCHLDs and leaving SIGCHLD blocked in gdb, the goal being that any new thread will have SIGCHLD blocked, at least initially.
Comment 21 Pedro Alves 2015-05-27 18:36:58 UTC
Hmm, I guess I'm confused on how blocking the signal on a thread can help.  AFAICS, the main issue is with libraries changing the SIGCHLD sigaction, which is process-wide, not per-thread.  So if something sets SA_NOCLDSTOP or SIG_IGN on SIGCHLD, that applies to the whole process.

I just confirmed now that with SIGCHLD set to SA_NOCLDSTOP or SIG_IGN, nothing comes out of the signalfd either.

So the workarounds seem to me to be:

 - move ptrace handling to a separate process (either always using gdbserver, 
   or a thinner ptrace wrapper/helper)

 - or perhaps, an evil hack that reimplements sigaction+signal and 
   ignores requests to ignore SIGCHLD.  References to those functions in   
   libraries would resolve to gdb's versions, assuming gdb is not itself 
   a library...
Comment 22 dje 2015-05-27 21:19:33 UTC
(In reply to Pedro Alves from comment #21)
> Hmm, I guess I'm confused on how blocking the signal on a thread can help. 
> AFAICS, the main issue is with libraries changing the SIGCHLD sigaction,
> which is process-wide, not per-thread.  So if something sets SA_NOCLDSTOP or
> SIG_IGN on SIGCHLD, that applies to the whole process.
> 
> I just confirmed now that with SIGCHLD set to SA_NOCLDSTOP or SIG_IGN,
> nothing comes out of the signalfd either.
> 
> So the workarounds seem to me to be:
> 
>  - move ptrace handling to a separate process (either always using
> gdbserver, 
>    or a thinner ptrace wrapper/helper)
> 
>  - or perhaps, an evil hack that reimplements sigaction+signal and 
>    ignores requests to ignore SIGCHLD.  References to those functions in   
>    libraries would resolve to gdb's versions, assuming gdb is not itself 
>    a library...

That this won't help with libraries changing SIGCHLD was left as a given, duh.

It was just offered for reference sake.
Comment 23 Pedro Alves 2015-05-28 11:45:37 UTC
> That this won't help with libraries changing SIGCHLD was left as a given, duh.
> It was just offered for reference sake.

Tromey's comment above yours suggested blocking as a workaround, and I'm left confused on how blocking is supposed to help.  I could do without the disparaging "duh"s though, thank you very much.