Bug 16168 - Signal heavy execution + repeated breakpoint locks up gbserver
Summary: Signal heavy execution + repeated breakpoint locks up gbserver
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: server (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-13 20:28 UTC by Sterling Augustine
Modified: 2014-11-23 15:33 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
files to reproduce. (2.50 KB, application/x-tar)
2013-11-13 20:28 UTC, Sterling Augustine
Details
More elaborate test case (1.56 KB, application/x-tar)
2013-11-14 01:06 UTC, Sterling Augustine
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sterling Augustine 2013-11-13 20:28:32 UTC
Created attachment 7276 [details]
files to reproduce.

The attached tar file includes a source file, a bash script, and a gdb script which exposes a bug in gdbserver's signal handling.

You can simply run sh doit.sh to reproduce the problem.

gdbserver attaches to a multi-threaded application, which is also taking SIGPROF signals.

The program repeatedly hits a breakpoint in some of the threads, and continues.

At some point, the SIGPROF will trigger a situation where a thread has a pending signal, so gdbserver elects not to restart all threads.

In the included gdbserver.log file, this is the line: "Not resuming, all-stop and found an LWP with pending status."

The only thread ever restarted is the pending one. Eventually, this thread runs out of work and the system locks up.

There is a race involved, so you may need to run it a couple of times. Sometimes it happens very early, and these are the easiest logs to study.
Comment 1 Sterling Augustine 2013-11-14 01:06:39 UTC
Created attachment 7279 [details]
More elaborate test case

This newly uploaded file is a test case for the patch proposed at:

https://sourceware.org/ml/gdb-patches/2013-11/msg00361.html
Comment 2 dje 2013-12-04 19:35:59 UTC
What happens here is this:

1) This is all-stop, SIGPROF is active, and a thread hits a breakpoint.
gdbserver stops all threads, and while stopping all threads one thread gets a SIGPROF.

2) gdb then advances the breakpointed thread passed the breakpoint and then resumes all threads.

3) gdbserver gets the resume request and looks for a thread with a pending signal, finds it (the SIGPROF'd thread), and leaves all threads stopped knowing linux_wait_for_event will find the thread with status_pending_p (there could be more than one of course).

4) gdbserver then enters wait processing for all threads, linux_wait_for_thread finds the SIGPROF'd thread which linux_wait_1 forwards on to the inferior, and goes back to waiting for all threads.

5) At this point only the SIGPROF'd thread is running and linux_wait_1 is waiting for an event worthy of reporting back to gdb.
gdbserver sees the SIGSTOP that was sent earlier to stop all threads, knows it no longer cares about it, resumes the thread, and goes back to waiting for all threads. The thread continues to receive SIGPROF which are continually forwarded on and eventually the thread exits.

6) At this point gdbserver is hung waiting for an event from some thread, but no threads are running.

From a high level perspective, if we want to keep the "any_pending" processing, a signal gdb doesn't care about is different than a signal gdb does care about, and the "any_pending" processing that gdbserver does only applies to the latter, not the former.  E.g., if there are 10 threads to be resumed, 1 of which is a "normal" resume after a SIGSTOP, and 9 have different signals all marked as "nostop noprint pass", then that is no different than having the same 10 threads all marked for "normal" resumption: resume them all in the way appropriate for each thread.
Thus, from a high level perspective, IWBN to distinguish signals thusly.  Whether that's actually easy/possible in the implementation ... have to see.
Comment 3 dje 2013-12-04 19:37:17 UTC
(In reply to dje from comment #2)
> What happens here is this:

For completeness sake, that's from analyzing the hang using thread-test-2 in the attached testcase.
Comment 4 eclipsehivernale 2014-11-23 15:33:53 UTC
I am a software developer of a multi threaded application (about 10 threads).
Recently we decided to use tcmalloc instead of the glibc malloc.
It is a google open source malloc optimized for multi allocation allocation.

Since this change, it is impossible to use gdbserver.
The SIGPROF signal management is automatic in tcmalloc library.
After a few "next" operation, gdbserver hangs, waiting for a pending event from thread which has received a SIGPROF signal, exactly like you describe in your comment.

It is still possible to use gdb directly on the remote target, but this is a waste of time.
I also observed once gdb hanged in native configuration, but I can't tell for sure it is the same issue as I just killed it and tried again.

I tested the patch you posted: https://sourceware.org/ml/gdb-patches/2013-11/msg00361.html and it seems to work fine on 7.8.50.20141107.

There are other freeze/hangs reported in the bug zilla database that may be linked to this issue, since it can appear by using any running operation (next, step, break...) and every gdb version so far are impacted.

I think more and more people will face this issue (tcmalloc + multi threaded application without control on SIGPROF) and I would like to push to integrate a fix in the next version of gdb.

Anyway thanks a lot to you for the investigation and the fix suggestion.
If no action is taken to fix gdb then I guess I will use your fix locally forever.