Bug 18127 - threads spawned by infcall end up stuck in "running" state
Summary: threads spawned by infcall end up stuck in "running" state
Status: RESOLVED FIXED
Alias: None
Product: gdb
Classification: Unclassified
Component: threads (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: 7.10
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-14 15:58 UTC by Pedro Alves
Modified: 2015-06-29 18:57 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
attachment-98964-0.html (1.06 KB, text/html)
2015-05-05 18:05 UTC, richard.sharman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pedro Alves 2015-03-14 15:58:12 UTC
Ref: https://sourceware.org/ml/gdb/2015-03/msg00033.html

Calling a function that spawns new threads results in the new threads getting stuck in "running" state.

On GNU/Linux, and a trivial program that has:

~~~
void
start_thread (void)
{
  pthread_t thread;

  pthread_create (&thread, NULL, thread_function, NULL);
}
~~~

calling that from GDB results in:

(gdb) p start_thread ()
[New Thread 0x7ffff7fc1700 (LWP 9903)]
$1 = void
(gdb) info threads
  Id   Target Id         Frame
  2    Thread 0x7ffff7fc1700 (LWP 9903) "start-thread-in" (running)
* 1    Thread 0x7ffff7fc2740 (LWP 9899) "start-thread-in" main () at start-thread-infcall.c:35
Comment 1 Richard Sharman 2015-05-05 15:31:57 UTC
I have seen something similar:  when stopped at a breakpoint printing a variable that has a [python] pretty printer (and scheduler-locking is off) quite often other threads run before the printing is finished,  and if any of these threads create other threads then these newly created threads -- and, for some reason some existing threads --  appear as running.


The following shows a normal "into thread" with all threads stopped, but
after the print msg [a structure that contains a field that is pretty printed]
the info threads shows many threads running,  and not just the new ones.  At this point we cannot continue.


(gdb) c
Continuing.
[New Thread 0xf70e3930 (LWP 7456)]
[Switching to Thread 0xf71a67b0 (LWP 7427)]

Breakpoint 1, ccsend (msg=...) at cgen/message0.cc:498
(gdb) info thread
  Id   Target Id         Frame 
  19   Thread 0xf70e3930 (LWP 7456) "McastAudit" 0xf7ffd430 in __kernel_vsyscall ()
  18   Thread 0xf7105930 (LWP 7452) "McastAudit" 0xf7ffd430 in __kernel_vsyscall ()
  17   Thread 0xf70ed7b0 (LWP 7453) "auditmgr       " 0xf7ffd430 in __kernel_vsyscall ()
  16   Thread 0xf7134930 (LWP 7451) "McastAudit" 0xf7ffd430 in __kernel_vsyscall ()
  15   Thread 0xf078d7b0 (LWP 7447) "debug_term     " 0xf7ffd430 in __kernel_vsyscall ()
  14   Thread 0xf71527b0 (LWP 7448) "auditwork      " 0xf7ffd430 in __kernel_vsyscall ()
  13   Thread 0xf7148930 (LWP 7449) "tSsuSrvHS" 0x0072b505 in dl_open_worker () from /lib/ld-linux.so.2
  12   Thread 0xf713e930 (LWP 7450) "McastAudit" 0xf7ffd430 in __kernel_vsyscall ()
  11   Thread 0xf07977b0 (LWP 7446) "maint_term     " 0xf7ffd430 in __kernel_vsyscall ()
  10   Thread 0xf7ffa7b0 (LWP 7441) "dbgmtterm      " 0xf7ffd430 in __kernel_vsyscall ()
  9    Thread 0xf719c7b0 (LWP 7442) "cmsgsysin      " 0xf7ffd430 in __kernel_vsyscall ()
  8    Thread 0xf71667b0 (LWP 7443) "auditmgr       " 0xf7ffd430 in __kernel_vsyscall ()
  7    Thread 0xf715c930 (LWP 7444) "tMainApplQ" 0xf7ffd430 in __kernel_vsyscall ()
  6    Thread 0xf13bc7b0 (LWP 7423) "dispatch       " 0xf7ffd430 in __kernel_vsyscall ()
  5    Thread 0xf7ff07b0 (LWP 7424) "msgtimer       " DynArray<Array<unsigned short, 0, 3> >::operator[] (
    this=0x99f2c46 <msgtimer_allocation>, index=4262) at /home/gx5000/sharman/mpascaldir/d08/x86-linux/Array.h:180
  4    Thread 0xf7fe67b0 (LWP 7425) "guardian       " 0xf7ffd430 in __kernel_vsyscall ()
  3    Thread 0xf71b07b0 (LWP 7426) "cleanup        " 0xf7ffd430 in __kernel_vsyscall ()
* 2    Thread 0xf71a67b0 (LWP 7427) "sysinit        " ccsend (msg=...) at cgen/message0.cc:498
  1    Thread 0xf7741a30 (LWP 7412) "cc" 0xf7ffd430 in __kernel_vsyscall ()
(gdb) p msg
$4 = (message &) @0xf71a5a3a: {control_byte = {msg_type = reg_msg, redun = {sdrd = plane_0, rxrd = plane_0}}, tx_node = {
    group = 0 '\000', level = maincpu, {{subsystem_id = message_switch, upper_id_byte = 0 '\000', lower_id_byte = 0 '\000'}, {
        cntrlr_no = 0 '\000', card_no = 0 '\000', circuit_no = 0 '\000'}}}, rx_node = {group = 0 '\000', level = maincpu, {{
        subsystem_id = message_switch, upper_id_byte = 0 '\000', lower_id_byte = 0 '\000'}, {cntrlr_no = 0 '\000', 
        card_no = 0 '\000', circuit_no = 0 '\000'}}}, enter_lld_when_received = 0 '\000', [New Thread 0xf70d9930 (LWP 7457)]

  tx_sw = 0x70006  (7,6)	"sysinit   ", [New Thread 0xf70c5930 (LWP 7459)]
[New Thread 0xf70cf930 (LWP 7458)]
rx_sw = 0x7000d  (7,13)	"auditmgr  ", tx_applic_id = nil_applic, 
  function_code = 13 '\r', data = {storage = "\002", '\000' <repeats 18 times>}, checksum = 0 '\000', icb = 0 '\000'}
(gdb) info thread
  Id   Target Id         Frame 
  22   Thread 0xf70cf930 (LWP 7458) "McastAudit" 0xf7ffd430 in __kernel_vsyscall ()
  21   Thread 0xf70c5930 (LWP 7459) "SsuPeerSrvI" 0xf7ffd430 in __kernel_vsyscall ()
  20   Thread 0xf70d9930 (LWP 7457) "McastAudit" (running)
  19   Thread 0xf70e3930 (LWP 7456) "McastAudit" (running)
  18   Thread 0xf7105930 (LWP 7452) "McastAudit" (running)
  17   Thread 0xf70ed7b0 (LWP 7453) "auditmgr       " (running)
  16   Thread 0xf7134930 (LWP 7451) "McastAudit" (running)
  15   Thread 0xf078d7b0 (LWP 7447) "debug_term     " (running)
  14   Thread 0xf71527b0 (LWP 7448) "auditwork      " (running)
  13   Thread 0xf7148930 (LWP 7449) "tSsuSrvHS" (running)
  12   Thread 0xf713e930 (LWP 7450) "McastAudit" (running)
  11   Thread 0xf07977b0 (LWP 7446) "maint_term     " (running)
  10   Thread 0xf7ffa7b0 (LWP 7441) "dbgmtterm      " (running)
  9    Thread 0xf719c7b0 (LWP 7442) "cmsgsysin      " (running)
  8    Thread 0xf71667b0 (LWP 7443) "auditmgr       " (running)
  7    Thread 0xf715c930 (LWP 7444) "tMainApplQ" (running)
  6    Thread 0xf13bc7b0 (LWP 7423) "dispatch       " (running)
  5    Thread 0xf7ff07b0 (LWP 7424) "msgtimer       " (running)
  4    Thread 0xf7fe67b0 (LWP 7425) "guardian       " (running)
  3    Thread 0xf71b07b0 (LWP 7426) "cleanup        " (running)
* 2    Thread 0xf71a67b0 (LWP 7427) "sysinit        " (running)
  1    Thread 0xf7741a30 (LWP 7412) "cc" (running)
(gdb) c
Continuing.
Cannot execute this command while the selected thread is running.
(gdb)
Comment 2 Pedro Alves 2015-05-05 17:07:40 UTC
Thanks.   

> quite often other threads run before the printing is finished

That means the pretty printer is calling functions in the inferior, which then ends up being the same problem.

The workaround is to switch to one of the threads that GDB knows is stopped (e.g., thread 22 in your case) and continue that one, or do "stepi" -- when that step finishes, the threads' states sync up.
Comment 3 richard.sharman 2015-05-05 18:05:27 UTC
Created attachment 8301 [details]
attachment-98964-0.html

Thanks - I thought of that workaround after I'd sent mail.
On a subsequent run all threads were marked as stopped!

I forgot to mention the version of gdb I was running;  it was 7.9.

Richard


On 5 May 2015 at 13:07, palves at redhat dot com <
sourceware-bugzilla@sourceware.org> wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=18127
>
> --- Comment #2 from Pedro Alves <palves at redhat dot com> ---
> Thanks.
>
> > quite often other threads run before the printing is finished
>
> That means the pretty printer is calling functions in the inferior, which
> then
> ends up being the same problem.
>
> The workaround is to switch to one of the threads that GDB knows is stopped
> (e.g., thread 22 in your case) and continue that one, or do "stepi" -- when
> that step finishes, the threads' states sync up.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 4 Eli Zaretskii 2015-06-10 15:33:15 UTC
On MS-Windows during native MinGW debugging, this issue, when it happens, makes the debugging session unusable.  MinGW native debugging doesn't support async execution, and therefore there's no command to stop the threads that GDB considers "running", nor help GDB re-synchronize its notion of thread states with the actual situation (which of course is that the threads are all suspended by the OS).

Unlike in the examples brought here from Unix and GNU systems, I see this on Windows when I call functions from the inferior.  Those functions don't start any threads; the threads that trigger the problem are started by Windows for reasons unknown to me.  And because in Windows native debugging the set_running function is called with minus_one_ptid, it marks all the threads as running.

This isan acute problem that needs to be solved at least for the above configuration.
Comment 5 cvs-commit@gcc.gnu.org 2015-06-29 15:55:46 UTC
The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=28bf096c62d7da6b349605f3940f4c586a850f78

commit 28bf096c62d7da6b349605f3940f4c586a850f78
Author: Pedro Alves <palves@redhat.com>
Date:   Mon Jun 29 16:07:57 2015 +0100

    PR threads/18127 - threads spawned by infcall end up stuck in "running" state
    
    Refs:
     https://sourceware.org/ml/gdb/2015-03/msg00024.html
     https://sourceware.org/ml/gdb/2015-06/msg00005.html
    
    On GNU/Linux, if an infcall spawns a thread, that thread ends up with
    stuck running state.  This happens because:
    
     - when linux-nat.c detects a new thread, it marks them as running,
       and does not report anything to the core.
    
     - we skip finish_thread_state when the thread that is running the
       infcall stops.
    
    As result, that new thread ends up with stuck "running" state, even
    though it really is stopped.
    
    On Windows, _all_ threads end up stuck in running state, not just the
    one that was spawned.  That happens because when a new thread is
    detected, unlike linux-nat.c, windows-nat.c reports
    TARGET_WAITKIND_SPURIOUS to infrun.  It's the fact that that event
    does not cause a user-visible stop that triggers the problem.  When
    the target is re-resumed, we call set_running with a wildcard ptid,
    which marks all thread as running.  That set_running is not suppressed
    because the (leader) thread being resumed does not have in_infcall
    set.  Later, when the infcall finally finishes successfully, nothing
    marks all threads back to stopped.
    
    We can trigger the same problem on all targets by having a thread
    other than the one that is running the infcall report a breakpoint hit
    to infrun, and then have that breakpoint not cause a stop.  That's
    what the included test does.
    
    The fix is to stop GDB from suppressing the set_running calls while
    doing an infcall, and then set the threads back to stopped when the
    call finishes, iff they were originally stopped before the infcall
    started.  (Note the MI *running/*stopped event suppression isn't
    affected.)
    
    Tested on x86_64 GNU/Linux.
    
    gdb/ChangeLog:
    2015-06-29  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18127
    	* infcall.c (run_inferior_call): On infcall success, if the thread
    	was marked stopped before, reset it back to stopped.
    	* infrun.c (resume): Don't suppress the set_running calls when
    	doing an infcall.
    	(normal_stop): Only discard the finish_thread_state cleanup if the
    	infcall succeeded.
    
    gdb/testsuite/ChangeLog:
    2015-06-29  Pedro Alves  <palves@redhat.com>
    
    	PR threads/18127
    	* gdb.threads/hand-call-new-thread.c: New file.
    	* gdb.threads/hand-call-new-thread.c: New file.
Comment 6 Pedro Alves 2015-06-29 18:57:59 UTC
Fixed.