Bug 18600 - After forking and threads spawning, gdb leaves newly created threads stopped
Summary: After forking and threads spawning, gdb leaves newly created threads stopped
Status: RESOLVED FIXED
Alias: None
Product: gdb
Classification: Unclassified
Component: threads (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-25 17:08 UTC by Simon Marchi
Modified: 2015-08-06 14:03 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Marchi 2015-06-25 17:08:03 UTC
When a program forks and the fork childs start threads, some newly created threads are left stopped by gdb. It's easy to reproduce with the following example:

https://github.com/simark/gdb-fork-threads-test/

It worked with gdb 7.9, I bisected it and found that this commit introduces the regression.

commit 2db9a4275ceada4aad3443dc157b96dd2e23afc0
Author: Pedro Alves <palves@redhat.com>
Date:   Fri Feb 20 20:21:59 2015 +0000

    GNU/Linux: Stop using libthread_db/td_ta_thr_iter
Comment 1 Simon Marchi 2015-07-07 22:11:13 UTC
Relevant discussion thread: https://sourceware.org/ml/gdb-patches/2015-07/msg00153.html
Comment 2 cvs-commit@gcc.gnu.org 2015-07-30 17:55:16 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4dd63d488a76482543517c4c4cde699ee6fa33ef

commit 4dd63d488a76482543517c4c4cde699ee6fa33ef
Author: Pedro Alves <palves@redhat.com>
Date:   Thu Jul 30 18:50:29 2015 +0100

    PR threads/18600: Threads left stopped after fork+thread spawn
    
    When a program forks and another process start threads while gdb is
    handling the fork event, newly created threads are left stuck stopped
    by gdb, even though gdb presents them as "running", to the user.
    
    This can be seen with the test added by this patch.  The test has the
    inferior fork a certain number of times and waits for all children to
    exit.  Each fork child spawns a number of threads that do nothing and
    joins them immediately.  Normally, the program should run unimpeded
    (from the point of view of the user) and exit very quickly.  Without
    this fix, it doesn't because of some threads left stopped by gdb, so
    inferior 1 never exits.
    
    The program triggers when a new clone thread is found while inside the
    linux_stop_and_wait_all_lwps call in linux-thread-db.c:
    
          linux_stop_and_wait_all_lwps ();
    
          ALL_LWPS (lp)
    	if (ptid_get_pid (lp->ptid) == pid)
    	  thread_from_lwp (lp->ptid);
    
          linux_unstop_all_lwps ();
    
    Within linux_stop_and_wait_all_lwps, we reach
    linux_handle_extended_wait with the "stopping" parameter set to 1, and
    because of that we don't mark the new lwp as resumed.  As consequence,
    the subsequent resume_stopped_resumed_lwps, called from
    linux_unstop_all_lwps, never resumes the new LWP.
    
    There's lots of cruft in linux_handle_extended_wait that no longer
    makes sense.  On systems with CLONE events support, we don't rely on
    libthread_db for thread listing anymore, so the code that preserves
    stop_requested and the handling of last_resume_kind is all dead.
    
    So the fix is to remove all that, and simply always mark the new LWP
    as resumed, so that resume_stopped_resumed_lwps re-resumes it.
    
    gdb/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    	    Simon Marchi  <simon.marchi@ericsson.com>
    
    	PR threads/18600
    	* linux-nat.c (linux_handle_extended_wait): On CLONE event, always
    	mark the new thread as resumed.  Remove STOPPING parameter.
    	(wait_lwp): Adjust call to linux_handle_extended_wait.
    	(linux_nat_filter_event): Adjust call to
    	linux_handle_extended_wait.
    	(resume_stopped_resumed_lwps): Add debug output.
    
    gdb/testsuite/ChangeLog:
    2015-07-30  Simon Marchi  <simon.marchi@ericsson.com>
    	    Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* gdb.threads/fork-plus-threads.c: New file.
    	* gdb.threads/fork-plus-threads.exp: New file.
Comment 3 cvs-commit@gcc.gnu.org 2015-07-30 17:55:21 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=69dde7dcb81f6baf2b823dcc03e040c29ee5de7d

commit 69dde7dcb81f6baf2b823dcc03e040c29ee5de7d
Author: Pedro Alves <palves@redhat.com>
Date:   Wed Jul 22 18:01:46 2015 +0100

    PR threads/18600: Inferiors left around after fork+thread spawn
    
    The new gdb.threads/fork-plus-threads.exp test exposes one more
    problem.  When one types "info inferiors" after running the program,
    one see's a couple inferior left still, while there should only be
    inferior #1 left.  E.g.:
    
     (gdb) info inferiors
       Num  Description       Executable
       4    process 8393      /home/pedro/bugs/src/test
       2    process 8388      /home/pedro/bugs/src/test
     * 1    <null>            /home/pedro/bugs/src/test
     (gdb) info threads
    
    Calling prune_inferiors() manually at this point (from a top gdb) does
    not remove them, because they still have inf->pid != 0 (while they
    shouldn't).  This suggests that we never mourned those inferiors.
    
    Enabling logs (master + previous patch) we see:
    
     ...
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
     WL: Handling extended status 0x03057f
     LHEW: Got clone event from LWP 9513, new child is LWP 9579
     [New Thread 0x7ffff37b8700 (LWP 9579)]
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
     WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
    			    ^^^^^^^^
     [Thread 0x7ffff7fc2740 (LWP 9508) exited]
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
     WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
     [Thread 0x7ffff7fc2740 (LWP 9499) exited]
     RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
     ...
     (gdb) info inferiors
       Num  Description       Executable
       5    process 9508      /home/pedro/bugs/src/test
    		^^^^
       4    process 9503      /home/pedro/bugs/src/test
       3    process 9500      /home/pedro/bugs/src/test
       2    process 9499      /home/pedro/bugs/src/test
     * 1    <null>            /home/pedro/bugs/src/test
     (gdb)
     ...
    
    Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
    That's this in wait_lwp:
    
          /* Check if the thread has exited.  */
          if (WIFEXITED (status) || WIFSIGNALED (status))
    	{
    	  thread_dead = 1;
    	  if (debug_linux_nat)
    	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
    				target_pid_to_str (lp->ptid));
    	}
        }
    
    That was the leader thread reporting an exit, meaning the whole
    process is gone.  So the problem is that this code doesn't understand
    that an WIFEXITED status of the leader LWP should be reported to
    infrun as process exit.
    
    gdb/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* linux-nat.c (wait_lwp): Report to the core when thread group
    	leader exits.
    
    gdb/testsuite/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* gdb.threads/fork-plus-threads.exp: Test that "info inferiors"
    	only shows inferior 1.
Comment 4 cvs-commit@gcc.gnu.org 2015-07-30 18:05:23 UTC
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=b31f5e79aac67f61a684c3af281caebd2ac1eece

commit b31f5e79aac67f61a684c3af281caebd2ac1eece
Author: Pedro Alves <palves@redhat.com>
Date:   Thu Jul 30 18:55:36 2015 +0100

    PR threads/18600: Threads left stopped after fork+thread spawn
    
    When a program forks and another process start threads while gdb is
    handling the fork event, newly created threads are left stuck stopped
    by gdb, even though gdb presents them as "running", to the user.
    
    This can be seen with the test added by this patch.  The test has the
    inferior fork a certain number of times and waits for all children to
    exit.  Each fork child spawns a number of threads that do nothing and
    joins them immediately.  Normally, the program should run unimpeded
    (from the point of view of the user) and exit very quickly.  Without
    this fix, it doesn't because of some threads left stopped by gdb, so
    inferior 1 never exits.
    
    The program triggers when a new clone thread is found while inside the
    linux_stop_and_wait_all_lwps call in linux-thread-db.c:
    
          linux_stop_and_wait_all_lwps ();
    
          ALL_LWPS (lp)
    	if (ptid_get_pid (lp->ptid) == pid)
    	  thread_from_lwp (lp->ptid);
    
          linux_unstop_all_lwps ();
    
    Within linux_stop_and_wait_all_lwps, we reach
    linux_handle_extended_wait with the "stopping" parameter set to 1, and
    because of that we don't mark the new lwp as resumed.  As consequence,
    the subsequent resume_stopped_resumed_lwps, called from
    linux_unstop_all_lwps, never resumes the new LWP.
    
    There's lots of cruft in linux_handle_extended_wait that no longer
    makes sense.  On systems with CLONE events support, we don't rely on
    libthread_db for thread listing anymore, so the code that preserves
    stop_requested and the handling of last_resume_kind is all dead.
    
    So the fix is to remove all that, and simply always mark the new LWP
    as resumed, so that resume_stopped_resumed_lwps re-resumes it.
    
    gdb/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    	    Simon Marchi  <simon.marchi@ericsson.com>
    
    	PR threads/18600
    	* linux-nat.c (linux_handle_extended_wait): On CLONE event, always
    	mark the new thread as resumed.  Remove STOPPING parameter.
    	(wait_lwp): Adjust call to linux_handle_extended_wait.
    	(linux_nat_filter_event): Adjust call to
    	linux_handle_extended_wait.
    	(resume_stopped_resumed_lwps): Add debug output.
    
    gdb/testsuite/ChangeLog:
    2015-07-30  Simon Marchi  <simon.marchi@ericsson.com>
    	    Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* gdb.threads/fork-plus-threads.c: New file.
    	* gdb.threads/fork-plus-threads.exp: New file.
Comment 5 cvs-commit@gcc.gnu.org 2015-07-30 18:05:28 UTC
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=7476be08b73fdfba4eb91d891b235d4cf2e70f3b

commit 7476be08b73fdfba4eb91d891b235d4cf2e70f3b
Author: Pedro Alves <palves@redhat.com>
Date:   Thu Jul 30 18:55:36 2015 +0100

    PR threads/18600: Inferiors left around after fork+thread spawn
    
    The new gdb.threads/fork-plus-threads.exp test exposes one more
    problem.  When one types "info inferiors" after running the program,
    one see's a couple inferior left still, while there should only be
    inferior #1 left.  E.g.:
    
     (gdb) info inferiors
       Num  Description       Executable
       4    process 8393      /home/pedro/bugs/src/test
       2    process 8388      /home/pedro/bugs/src/test
     * 1    <null>            /home/pedro/bugs/src/test
     (gdb) info threads
    
    Calling prune_inferiors() manually at this point (from a top gdb) does
    not remove them, because they still have inf->pid != 0 (while they
    shouldn't).  This suggests that we never mourned those inferiors.
    
    Enabling logs (master + previous patch) we see:
    
     ...
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
     WL: Handling extended status 0x03057f
     LHEW: Got clone event from LWP 9513, new child is LWP 9579
     [New Thread 0x7ffff37b8700 (LWP 9579)]
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
     WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
    			    ^^^^^^^^
     [Thread 0x7ffff7fc2740 (LWP 9508) exited]
     WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
     WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
     [Thread 0x7ffff7fc2740 (LWP 9499) exited]
     RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
     ...
     (gdb) info inferiors
       Num  Description       Executable
       5    process 9508      /home/pedro/bugs/src/test
    		^^^^
       4    process 9503      /home/pedro/bugs/src/test
       3    process 9500      /home/pedro/bugs/src/test
       2    process 9499      /home/pedro/bugs/src/test
     * 1    <null>            /home/pedro/bugs/src/test
     (gdb)
     ...
    
    Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
    That's this in wait_lwp:
    
          /* Check if the thread has exited.  */
          if (WIFEXITED (status) || WIFSIGNALED (status))
    	{
    	  thread_dead = 1;
    	  if (debug_linux_nat)
    	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
    				target_pid_to_str (lp->ptid));
    	}
        }
    
    That was the leader thread reporting an exit, meaning the whole
    process is gone.  So the problem is that this code doesn't understand
    that an WIFEXITED status of the leader LWP should be reported to
    infrun as process exit.
    
    gdb/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* linux-nat.c (wait_lwp): Report to the core when thread group
    	leader exits.
    
    gdb/testsuite/ChangeLog:
    2015-07-30  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18600
    	* gdb.threads/fork-plus-threads.exp: Test that "info inferiors"
    	only shows inferior 1.
Comment 6 Pedro Alves 2015-07-30 18:26:22 UTC
Fixed.
Comment 7 cvs-commit@gcc.gnu.org 2015-08-06 14:02:19 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=f0ce0d3a331129309a46a6a9ac85fce35acae72b

commit f0ce0d3a331129309a46a6a9ac85fce35acae72b
Author: Pedro Alves <palves@redhat.com>
Date:   Thu Jul 23 16:01:01 2015 +0100

    gdbserver: move_out_of_jump_pad_callback misses switching current thread
    
    While hacking on the fix for PR threads/18600 (Threads left stopped
    after fork+thread spawn), I once saw its test (fork-plus-threads.exp)
    FAIL against gdbserver because move_out_of_jump_pad_callback has a
    gdb_breakpoint_here call, and the caller isn't making sure the current
    thread points to the right thread.  In the case I saw, the current
    thread pointed to the wrong process, so gdb_breakpoint_here returned
    the wrong answer.  Unfortunately I didn't save logs.  Still, seems
    obvious enough and it should fix a potential occasional racy FAIL.
    
    Tested on x86_64 Fedora 20.
    
    gdb/gdbserver/ChangeLog:
    2015-08-06  Pedro Alves  <palves@redhat.com>
    
    	* linux-low.c (move_out_of_jump_pad_callback): Temporarily switch
    	the current thread.
Comment 8 cvs-commit@gcc.gnu.org 2015-08-06 14:03:18 UTC
The gdb-7.10-branch branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=a01c46276aa3a16c2ac1dae249df14e39d1c281f

commit a01c46276aa3a16c2ac1dae249df14e39d1c281f
Author: Pedro Alves <palves@redhat.com>
Date:   Thu Aug 6 15:02:26 2015 +0100

    gdbserver: move_out_of_jump_pad_callback misses switching current thread
    
    While hacking on the fix for PR threads/18600 (Threads left stopped
    after fork+thread spawn), I once saw its test (fork-plus-threads.exp)
    FAIL against gdbserver because move_out_of_jump_pad_callback has a
    gdb_breakpoint_here call, and the caller isn't making sure the current
    thread points to the right thread.  In the case I saw, the current
    thread pointed to the wrong process, so gdb_breakpoint_here returned
    the wrong answer.  Unfortunately I didn't save logs.  Still, seems
    obvious enough and it should fix a potential occasional racy FAIL.
    
    Tested on x86_64 Fedora 20.
    
    gdb/gdbserver/ChangeLog:
    2015-08-06  Pedro Alves  <palves@redhat.com>
    
    	* linux-low.c (move_out_of_jump_pad_callback): Temporarily switch
    	the current thread.