Created attachment 6542 [details] Python command that hangs gdb. I'm attempting to write a gdb command in python that will display some data in a matplotlib plot. The plotting itself works fine: the plot is shown and I can interact with it. While the plot is open, the (gdb) prompt is not shown, as mp.show() is blocking. After I close the plot, the problems begin. I am returned to the (gdb) prompt and I can issue certain commands (for example, "list" works as expected). However, if I attempt to continue running the program being debugged (eg. with "next" or "continue"), then gdb becomes unresponsive (including to ^C). You can reproduce the problem with the file I have attached (requires matplotlib to be installed): echo "int main() { return 0; }" > gdbtest.c gcc -g -o gdbtest gdbtest.c gdb gdbtest (at the (gdb) prompt) source gdbtest2.py plot (close the plot window) run (gdb hangs) I am using gdb 7.4.1 on arch linux.
I tried this with 7.4 (from cvs) on Fedora 16. I'm using: barimba. rpm -q python-matplotlib python-matplotlib-1.0.1-12.fc16.x86_64 I ran gdb on itself, tried "plot", then "run". It all seems to work for me. I do think something could go wrong if your Python library tries to install new signal handlers, or if it calls one of the wait family of functions without specifying a PID. That doesn't seem to be happening for me, but maybe we have different versions or something.
Created attachment 6567 [details] Qt command that hangs gdb
Thanks for following this up. I can confirm that matplotlib-1.0.1 doesn't cause a problem (on either 7.4.1 or latest CVS), but I'm still having a problem with matplotlib-1.1.1. Actually, I can reproduce the problem with a simple Qt application (see the new attachment). I can sometimes reproduce the problem with the equivalent GTK application, but not reliably (ie. I can sometimes go on debugging for a minute or so before it hangs).
strace catches this code changing the SIGCHLD handler: rt_sigaction(SIGCHLD, {0x3716d53d70, [], SA_RESTORER|SA_NOCLDSTOP, 0x36d5c0f500}, {0x5e28c3, [], SA_RESTORER|SA_RESTART, 0x36d5c0f500}, 8) = 0 I suspect this is what breaks gdb.
In particular it is the use of SA_NOCLDSTOP. If I disable this (by breaking on __sigaction and tweaking the flags), then gdb works again. I am not sure what to suggest here. You could file a bug with Qt, I suppose. That is probably a pretty long road to getting a fix. If you wanted to do a hack you could arrange to load some other .so after loading Qt, and have this .so reset the signal flags. I don't know of a good way for gdb to work around this problem.
I'm working around it by having the GUI in a separate process, which works just fine. I guess I just had too high expectations after reading (what I now see is) your blog: http://tromey.com/blog/?p=550 Thanks for your help!
> In particular it is the use of SA_NOCLDSTOP. > If I disable this (by breaking on __sigaction and > tweaking the flags), then gdb works again. > I am not sure what to suggest here. Yeah. Such is the nature of signal based interfaces. Not much GDB could do, other than moving all ptrace processing to a separate process (hmm, like, ..., always spawning gdbserver).
I filed it upstream https://bugreports.qt-project.org/browse/QTBUG-26947 This is yet another argument, or maybe just the same one :), for some kind of "ptrace fd" though.
Created attachment 6919 [details] Simple multithreading that hangs gdb Another way to reproduce similar behaviour. Run gdb with a multi-threaded program and import matplotlib.pyplot (version 1.1.1) gdb ./testc_multithreading (gdb) python import matplotlib.pyplot (gdb) r 128
(In reply to Tom Tromey from comment #5) > I don't know of a good way for gdb to work around > this problem. I imagine signalfd could be used now.
Not sure how. It is worth pointing out that the Qt and kernel folks working on a alternative that allows waiting for child exits without using signals at all: https://lwn.net/Articles/638613/
I believe debugging against gdbserver instead of native gdb would work around this (simply because then it is gdbserver that wants SIGCHLD, not gdb), but I haven't tried it.
(In reply to Pedro Alves from comment #11) > Not sure how. I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the man page says to do) then the various sigaction calls that libraries may do will just be ignored. This is kind of gross though. > It is worth pointing out that the Qt and kernel folks working on a > alternative that allows waiting for child exits without using signals at all: > > https://lwn.net/Articles/638613/ That does look promising.
(In reply to Tom Tromey from comment #13) > I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the > man page says to do) then the various sigaction calls that libraries > may do will just be ignored. Ah, yes, but then the libraries were messing with SIGCHLD's sigaction because they want to be notified of child exits, which would end up broken for them. Seems like a "what's the better breakage" decision case, and that may indeed be better breakage.
(In reply to Tom Tromey from comment #13) > I think if gdb uses signalfd to get SIGCHLD, and blocks it (what the > man page says to do) then the various sigaction calls that libraries > may do will just be ignored. Hmm, doesn't look like the sigactions are actually ignored, according to: http://stackoverflow.com/questions/20228092/can-there-be-a-race-between-signalfd-and-sigaction If both the handler setup by the library is called, and gdb's signalfd descriptor is waken, then it might be just what we need.
Created attachment 8333 [details] make linux-nat.c use signalfd I tried making linux-nat.c use signalfd now. Unless I missed something, it doesn't seem to actually work for protecting against a library stealing SIGCHLDs though, so it ends up useless. :-/ See the #if 0 in the patch: ~~~ #if 0 /* Hmm, if we don't do this, then gdb hangs in wait_for_sigchld. That seems to mean that if a SIGCHLD signal handler is called, then the signalfd file ends up with nothing to read, and thus 'select' blocks forever. Test with: "gdb PROGRAM -ex "maint set target-async off" -ex "set debug lin-lwp 1" Which renders this approach worthless to protect against a library GDB links against from stealing out SIGCHLD handler... :-/ */ ~~~~~~~~~~~~~~~~~~ If I leave that on, things work, but if I disable it, I get: ~~~ $ gdb ~/gdb/tests/threads -ex "maint set target-async off" -ex "set debug lin-lwp 1" (gdb) start (...) LLW: exit LLR: Preparing to resume process 15243, 0, inferior_ptid process 15243 LLR: PTRACE_CONT process 15243, 0 (resume event thread) linux_nat_wait: [process 15243], [] LLW: enter LNW: waitpid(-1, ...) returned 0, No child processes LNW: about to sigsuspend sigchld ^C^C^C^C^C^C^C^C *gdb hang* ~~~ Adding more logging in wait_for_sigchld, I could observe that sometimes, the select returns 1, and then the read fails with EAGAIN. Thinking that perhaps treating that as "got SIGCHLD, but some handler consumed it" makes a "(gdb) start" sometimes work, but not always... It seems that the select only wakes up / returns 1 if the signal arrives while gdb is already blocked inside select. If the signal arrives and is handled before gdb reaches select, then select hangs forever. This kind of makes sense if the signalfd's "data" is generated on the fly from the pending signals, not really queued in the kernel, which it is probably how this works on the kernel side...
(Note that "maint set target-async off" is not really necessary to reach the sigsuspend paths; it was just a way to make them more frequent, for quicker testing.)
(... and also that if we got past inferior startup (which is always synchronous), gdb's select/poll in the main event loop hangs in the same way, when target-async is left on.)
Thanks for trying that! It's a bummer that it didn't work. I have a workaround that works for my case -- it's a way to block SIGCHLD in the new thread just in my application -- so I guess a better fix can wait for the kernel to catch up. This area is one of the main reasons I wanted PTRACE_FD... Also I wonder whether gdb should be using pthread_sigmask instead of sigprocmask now.
for reference sake, o11c I think suggested on #gdb having gdb start a thread to take SIGCHLDs and leaving SIGCHLD blocked in gdb, the goal being that any new thread will have SIGCHLD blocked, at least initially.
Hmm, I guess I'm confused on how blocking the signal on a thread can help. AFAICS, the main issue is with libraries changing the SIGCHLD sigaction, which is process-wide, not per-thread. So if something sets SA_NOCLDSTOP or SIG_IGN on SIGCHLD, that applies to the whole process. I just confirmed now that with SIGCHLD set to SA_NOCLDSTOP or SIG_IGN, nothing comes out of the signalfd either. So the workarounds seem to me to be: - move ptrace handling to a separate process (either always using gdbserver, or a thinner ptrace wrapper/helper) - or perhaps, an evil hack that reimplements sigaction+signal and ignores requests to ignore SIGCHLD. References to those functions in libraries would resolve to gdb's versions, assuming gdb is not itself a library...
(In reply to Pedro Alves from comment #21) > Hmm, I guess I'm confused on how blocking the signal on a thread can help. > AFAICS, the main issue is with libraries changing the SIGCHLD sigaction, > which is process-wide, not per-thread. So if something sets SA_NOCLDSTOP or > SIG_IGN on SIGCHLD, that applies to the whole process. > > I just confirmed now that with SIGCHLD set to SA_NOCLDSTOP or SIG_IGN, > nothing comes out of the signalfd either. > > So the workarounds seem to me to be: > > - move ptrace handling to a separate process (either always using > gdbserver, > or a thinner ptrace wrapper/helper) > > - or perhaps, an evil hack that reimplements sigaction+signal and > ignores requests to ignore SIGCHLD. References to those functions in > libraries would resolve to gdb's versions, assuming gdb is not itself > a library... That this won't help with libraries changing SIGCHLD was left as a given, duh. It was just offered for reference sake.
> That this won't help with libraries changing SIGCHLD was left as a given, duh. > It was just offered for reference sake. Tromey's comment above yours suggested blocking as a workaround, and I'm left confused on how blocking is supposed to help. I could do without the disparaging "duh"s though, thank you very much.
FWIW we have gdb.block_signals and gdb.Thread now. This makes it a little easier to handle the threading case. In the long run either: 1. we should always have a gdbserver; lldb does this, or 2. some kind of ptrace fd gdb doesn't need kernel help in order to implement #1, so maybe that should be preferred.