When a program forks and the fork childs start threads, some newly created threads are left stopped by gdb. It's easy to reproduce with the following example: https://github.com/simark/gdb-fork-threads-test/ It worked with gdb 7.9, I bisected it and found that this commit introduces the regression. commit 2db9a4275ceada4aad3443dc157b96dd2e23afc0 Author: Pedro Alves <palves@redhat.com> Date: Fri Feb 20 20:21:59 2015 +0000 GNU/Linux: Stop using libthread_db/td_ta_thr_iter
Relevant discussion thread: https://sourceware.org/ml/gdb-patches/2015-07/msg00153.html
The master branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4dd63d488a76482543517c4c4cde699ee6fa33ef commit 4dd63d488a76482543517c4c4cde699ee6fa33ef Author: Pedro Alves <palves@redhat.com> Date: Thu Jul 30 18:50:29 2015 +0100 PR threads/18600: Threads left stopped after fork+thread spawn When a program forks and another process start threads while gdb is handling the fork event, newly created threads are left stuck stopped by gdb, even though gdb presents them as "running", to the user. This can be seen with the test added by this patch. The test has the inferior fork a certain number of times and waits for all children to exit. Each fork child spawns a number of threads that do nothing and joins them immediately. Normally, the program should run unimpeded (from the point of view of the user) and exit very quickly. Without this fix, it doesn't because of some threads left stopped by gdb, so inferior 1 never exits. The program triggers when a new clone thread is found while inside the linux_stop_and_wait_all_lwps call in linux-thread-db.c: linux_stop_and_wait_all_lwps (); ALL_LWPS (lp) if (ptid_get_pid (lp->ptid) == pid) thread_from_lwp (lp->ptid); linux_unstop_all_lwps (); Within linux_stop_and_wait_all_lwps, we reach linux_handle_extended_wait with the "stopping" parameter set to 1, and because of that we don't mark the new lwp as resumed. As consequence, the subsequent resume_stopped_resumed_lwps, called from linux_unstop_all_lwps, never resumes the new LWP. There's lots of cruft in linux_handle_extended_wait that no longer makes sense. On systems with CLONE events support, we don't rely on libthread_db for thread listing anymore, so the code that preserves stop_requested and the handling of last_resume_kind is all dead. So the fix is to remove all that, and simply always mark the new LWP as resumed, so that resume_stopped_resumed_lwps re-resumes it. gdb/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> Simon Marchi <simon.marchi@ericsson.com> PR threads/18600 * linux-nat.c (linux_handle_extended_wait): On CLONE event, always mark the new thread as resumed. Remove STOPPING parameter. (wait_lwp): Adjust call to linux_handle_extended_wait. (linux_nat_filter_event): Adjust call to linux_handle_extended_wait. (resume_stopped_resumed_lwps): Add debug output. gdb/testsuite/ChangeLog: 2015-07-30 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR threads/18600 * gdb.threads/fork-plus-threads.c: New file. * gdb.threads/fork-plus-threads.exp: New file.
The master branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=69dde7dcb81f6baf2b823dcc03e040c29ee5de7d commit 69dde7dcb81f6baf2b823dcc03e040c29ee5de7d Author: Pedro Alves <palves@redhat.com> Date: Wed Jul 22 18:01:46 2015 +0100 PR threads/18600: Inferiors left around after fork+thread spawn The new gdb.threads/fork-plus-threads.exp test exposes one more problem. When one types "info inferiors" after running the program, one see's a couple inferior left still, while there should only be inferior #1 left. E.g.: (gdb) info inferiors Num Description Executable 4 process 8393 /home/pedro/bugs/src/test 2 process 8388 /home/pedro/bugs/src/test * 1 <null> /home/pedro/bugs/src/test (gdb) info threads Calling prune_inferiors() manually at this point (from a top gdb) does not remove them, because they still have inf->pid != 0 (while they shouldn't). This suggests that we never mourned those inferiors. Enabling logs (master + previous patch) we see: ... WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped) WL: Handling extended status 0x03057f LHEW: Got clone event from LWP 9513, new child is LWP 9579 [New Thread 0x7ffff37b8700 (LWP 9579)] WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited) WL: Thread 0x7ffff7fc2740 (LWP 9508) exited. ^^^^^^^^ [Thread 0x7ffff7fc2740 (LWP 9508) exited] WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited) WL: Thread 0x7ffff7fc2740 (LWP 9499) exited. [Thread 0x7ffff7fc2740 (LWP 9499) exited] RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0 ... (gdb) info inferiors Num Description Executable 5 process 9508 /home/pedro/bugs/src/test ^^^^ 4 process 9503 /home/pedro/bugs/src/test 3 process 9500 /home/pedro/bugs/src/test 2 process 9499 /home/pedro/bugs/src/test * 1 <null> /home/pedro/bugs/src/test (gdb) ... Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line. That's this in wait_lwp: /* Check if the thread has exited. */ if (WIFEXITED (status) || WIFSIGNALED (status)) { thread_dead = 1; if (debug_linux_nat) fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n", target_pid_to_str (lp->ptid)); } } That was the leader thread reporting an exit, meaning the whole process is gone. So the problem is that this code doesn't understand that an WIFEXITED status of the leader LWP should be reported to infrun as process exit. gdb/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> PR threads/18600 * linux-nat.c (wait_lwp): Report to the core when thread group leader exits. gdb/testsuite/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> PR threads/18600 * gdb.threads/fork-plus-threads.exp: Test that "info inferiors" only shows inferior 1.
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=b31f5e79aac67f61a684c3af281caebd2ac1eece commit b31f5e79aac67f61a684c3af281caebd2ac1eece Author: Pedro Alves <palves@redhat.com> Date: Thu Jul 30 18:55:36 2015 +0100 PR threads/18600: Threads left stopped after fork+thread spawn When a program forks and another process start threads while gdb is handling the fork event, newly created threads are left stuck stopped by gdb, even though gdb presents them as "running", to the user. This can be seen with the test added by this patch. The test has the inferior fork a certain number of times and waits for all children to exit. Each fork child spawns a number of threads that do nothing and joins them immediately. Normally, the program should run unimpeded (from the point of view of the user) and exit very quickly. Without this fix, it doesn't because of some threads left stopped by gdb, so inferior 1 never exits. The program triggers when a new clone thread is found while inside the linux_stop_and_wait_all_lwps call in linux-thread-db.c: linux_stop_and_wait_all_lwps (); ALL_LWPS (lp) if (ptid_get_pid (lp->ptid) == pid) thread_from_lwp (lp->ptid); linux_unstop_all_lwps (); Within linux_stop_and_wait_all_lwps, we reach linux_handle_extended_wait with the "stopping" parameter set to 1, and because of that we don't mark the new lwp as resumed. As consequence, the subsequent resume_stopped_resumed_lwps, called from linux_unstop_all_lwps, never resumes the new LWP. There's lots of cruft in linux_handle_extended_wait that no longer makes sense. On systems with CLONE events support, we don't rely on libthread_db for thread listing anymore, so the code that preserves stop_requested and the handling of last_resume_kind is all dead. So the fix is to remove all that, and simply always mark the new LWP as resumed, so that resume_stopped_resumed_lwps re-resumes it. gdb/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> Simon Marchi <simon.marchi@ericsson.com> PR threads/18600 * linux-nat.c (linux_handle_extended_wait): On CLONE event, always mark the new thread as resumed. Remove STOPPING parameter. (wait_lwp): Adjust call to linux_handle_extended_wait. (linux_nat_filter_event): Adjust call to linux_handle_extended_wait. (resume_stopped_resumed_lwps): Add debug output. gdb/testsuite/ChangeLog: 2015-07-30 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR threads/18600 * gdb.threads/fork-plus-threads.c: New file. * gdb.threads/fork-plus-threads.exp: New file.
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=7476be08b73fdfba4eb91d891b235d4cf2e70f3b commit 7476be08b73fdfba4eb91d891b235d4cf2e70f3b Author: Pedro Alves <palves@redhat.com> Date: Thu Jul 30 18:55:36 2015 +0100 PR threads/18600: Inferiors left around after fork+thread spawn The new gdb.threads/fork-plus-threads.exp test exposes one more problem. When one types "info inferiors" after running the program, one see's a couple inferior left still, while there should only be inferior #1 left. E.g.: (gdb) info inferiors Num Description Executable 4 process 8393 /home/pedro/bugs/src/test 2 process 8388 /home/pedro/bugs/src/test * 1 <null> /home/pedro/bugs/src/test (gdb) info threads Calling prune_inferiors() manually at this point (from a top gdb) does not remove them, because they still have inf->pid != 0 (while they shouldn't). This suggests that we never mourned those inferiors. Enabling logs (master + previous patch) we see: ... WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped) WL: Handling extended status 0x03057f LHEW: Got clone event from LWP 9513, new child is LWP 9579 [New Thread 0x7ffff37b8700 (LWP 9579)] WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited) WL: Thread 0x7ffff7fc2740 (LWP 9508) exited. ^^^^^^^^ [Thread 0x7ffff7fc2740 (LWP 9508) exited] WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited) WL: Thread 0x7ffff7fc2740 (LWP 9499) exited. [Thread 0x7ffff7fc2740 (LWP 9499) exited] RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0 ... (gdb) info inferiors Num Description Executable 5 process 9508 /home/pedro/bugs/src/test ^^^^ 4 process 9503 /home/pedro/bugs/src/test 3 process 9500 /home/pedro/bugs/src/test 2 process 9499 /home/pedro/bugs/src/test * 1 <null> /home/pedro/bugs/src/test (gdb) ... Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line. That's this in wait_lwp: /* Check if the thread has exited. */ if (WIFEXITED (status) || WIFSIGNALED (status)) { thread_dead = 1; if (debug_linux_nat) fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n", target_pid_to_str (lp->ptid)); } } That was the leader thread reporting an exit, meaning the whole process is gone. So the problem is that this code doesn't understand that an WIFEXITED status of the leader LWP should be reported to infrun as process exit. gdb/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> PR threads/18600 * linux-nat.c (wait_lwp): Report to the core when thread group leader exits. gdb/testsuite/ChangeLog: 2015-07-30 Pedro Alves <palves@redhat.com> PR threads/18600 * gdb.threads/fork-plus-threads.exp: Test that "info inferiors" only shows inferior 1.
Fixed.
The master branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=f0ce0d3a331129309a46a6a9ac85fce35acae72b commit f0ce0d3a331129309a46a6a9ac85fce35acae72b Author: Pedro Alves <palves@redhat.com> Date: Thu Jul 23 16:01:01 2015 +0100 gdbserver: move_out_of_jump_pad_callback misses switching current thread While hacking on the fix for PR threads/18600 (Threads left stopped after fork+thread spawn), I once saw its test (fork-plus-threads.exp) FAIL against gdbserver because move_out_of_jump_pad_callback has a gdb_breakpoint_here call, and the caller isn't making sure the current thread points to the right thread. In the case I saw, the current thread pointed to the wrong process, so gdb_breakpoint_here returned the wrong answer. Unfortunately I didn't save logs. Still, seems obvious enough and it should fix a potential occasional racy FAIL. Tested on x86_64 Fedora 20. gdb/gdbserver/ChangeLog: 2015-08-06 Pedro Alves <palves@redhat.com> * linux-low.c (move_out_of_jump_pad_callback): Temporarily switch the current thread.
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=a01c46276aa3a16c2ac1dae249df14e39d1c281f commit a01c46276aa3a16c2ac1dae249df14e39d1c281f Author: Pedro Alves <palves@redhat.com> Date: Thu Aug 6 15:02:26 2015 +0100 gdbserver: move_out_of_jump_pad_callback misses switching current thread While hacking on the fix for PR threads/18600 (Threads left stopped after fork+thread spawn), I once saw its test (fork-plus-threads.exp) FAIL against gdbserver because move_out_of_jump_pad_callback has a gdb_breakpoint_here call, and the caller isn't making sure the current thread points to the right thread. In the case I saw, the current thread pointed to the wrong process, so gdb_breakpoint_here returned the wrong answer. Unfortunately I didn't save logs. Still, seems obvious enough and it should fix a potential occasional racy FAIL. Tested on x86_64 Fedora 20. gdb/gdbserver/ChangeLog: 2015-08-06 Pedro Alves <palves@redhat.com> * linux-low.c (move_out_of_jump_pad_callback): Temporarily switch the current thread.