17028 – GDB+GDBserver hangs on Windows waiting for stop event since target-async on by default

Bug 17028 - GDB+GDBserver hangs on Windows waiting for stop event since target-async on by default

Summary: GDB+GDBserver hangs on Windows waiting for stop event since target-async on b...

Status:	RESOLVED FIXED

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	remote (show other bugs)
Version:	HEAD

Importance:	P2 normal
Target Milestone:	7.8
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-05 20:57 UTC by Joel Brobecker
Modified:	2014-06-11 10:18 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
source file. (79 bytes, text/x-csrc) 2014-06-05 20:57 UTC, Joel Brobecker	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joel Brobecker 2014-06-05 20:57:43 UTC

Created attachment 7630 [details]
source file.

Since target-async was turned on by default, debugging on x86-windows using GDB+GDBserver sometimes hangs while waiting for a command that resumed the program's execution to report that the inferior stopped. The problem seems to be very racy in nature and does not allways appear, or not always at the same point in the debugging scenario (probability is about 70-80% of the time when debug traces are not enabled, it becomes much lower when GDBserver's --debug option is used, and nearly zero when GDBserver's --remote-debug option is used; enabling debug traces on the GDB side seem to have little effect on the probability).

I could reprorduce the problem using the attached file (simple.c), compiled with GCC on x86-windows:

   % gcc -o simple -g simple.c

In one shell, I started GDBserver:

   % /tardes.a/tmp/brobecke/bld/gdb-public/gdb/gdbserver/gdbserver.exe :4444 simple

In another, I started GDB, set a breakpoint on the first call to "increment", connected to the target, ran to that breakpoint, and then I do as many "next" as possible until the debugger hangs. For instance:

    (gdb) b simple.c:12
    Breakpoint 1 at 0x401610: file simple.c, line 12.
    (gdb) tar rem :4444
    Remote debugging using :4444
    0x77d7103c in ntdll!RtlUpcaseUnicodeToOemN ()
       from C:\Windows\SysWOW64\ntdll.dll
    (gdb) c
    Continuing.

    Breakpoint 1, main () at simple.c:12
    12        increment (&i);
    (gdb) n
    13        increment (&i);
    (gdb) n
    14        increment (&i);
    (gdb) n
    [hangs]

After having added ad hoc traces in GDBserver, and turning infrun+remote traces on the GDB side, it looks like GDB is not receiving a T05 packet that GDBserver is sending to signal that the step is done.

This is what we see for a faulty next on the GDB side:

    (gdb) n
    infrun: clear_proceed_status_thread (Thread 4424)
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT, step=1)
    Sending packet: $m401610,1#f6...Packet received: 8d
    Sending packet: $X401610,1:�#e7...Packet received: OK
    Sending packet: $m763c4b72,1#cc...Packet received: 55
    Sending packet: $X763c4b72,1:�#bd...Packet received: OK
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 4424] at 0x401634
    Sending packet: $vCont;s:1148;c#5e...

On the GDBserver side, this is what we see:

    DEBUG: write_prim ($OK#9a)
           -> 6
    handling possible serial event
    DEBUG: write_prim ($8d#9c)
           -> 6
    handling possible serial event
    DEBUG: write_prim ($OK#9a)
           -> 6
    handling possible serial event
    DEBUG: write_prim ($55#6a)
           -> 6
    handling possible serial event
    DEBUG: write_prim ($OK#9a)
           -> 6
    handling possible serial event
    gdbserver: kernel event EXCEPTION_DEBUG_EVENT for pid=4968 tid=1148
    EXCEPTION_SINGLE_STEP
    Child Stopped with signal = 5 
    Writing resume reply for LWP 4968.4424:1
    DEBUG: write_prim ($T0505:c8fe2800;04:a0fe2800;08:38164000;thread:1148;#f0)
           -> 55

The "DEBUG: write_prim" traces show that write_prim was called and what value it returned. In particular for our last vCont, it eventually sent a T05 packet, which GDB is waiting for.

At this point, attaching a debugger to the hanging GDB confirms that it is blocked, waiting for a socket event:

    #6  0x757841d8 in WaitForMultipleObjects ()
       from C:\Windows\syswow64\kernel32.dll
    #7  0x004708e7 in gdb_select (n=469, readfds=0x88ca50 <gdb_notifier+784>, 
        writefds=0x88cb54 <gdb_notifier+1044>, 
        exceptfds=0x88cc58 <gdb_notifier+1304>, timeout=0x0)
        at /[...]/gdb/mingw-hdep.c:172
    #8  0x00527926 in gdb_wait_for_event (block=1)
        at /[...]/gdb/event-loop.c:831
    #9  0x00526ff1 in gdb_do_one_event ()
        at /[...]/gdb/event-loop.c:403

Not sure why the event does not show up on the GDB side, though, and why target-async makes any difference.

Comment 1 Pedro Alves 2014-06-06 14:45:34 UTC

> Not sure why the event does not show up on the GDB side.
> though, and why target-async makes any difference.

I haven't tried debugging this myself, but I'm clueless atm too.

I'd suspect some gdb_select issue, but seems we block
in gdb_select too on !target-async, within
do_ser_base_readchar.

I'd add some debug output to mingw-hdep.c:gdb_select,
right after the WaitForMultipleObjects, to check whether
something odd is doing on, like actually waking up from
WaitForMultipleObjects once, but then not doing anything
with the result, and then blocking again (forever).

I was also wondering whether it'd be easy to make gdb_select
call the real select, to try to isolate/identify the
problem.  Note: I think ctrl-c would stop working properly.

Comment 2 Pedro Alves 2014-06-06 14:47:27 UTC

I have to say I'm quite surprised by this, given remote non-stop (+async) has been known to work on Windows hosts for a long while.

Comment 3 brobecker 2014-06-06 16:18:20 UTC

> I have to say I'm quite surprised by this, given remote non-stop (+async) has
> been known to work on Windows hosts for a long while.

I don't know, never tried it before.

One of the differences that "set debug serial 1" seems to highlight
is the fact that, in target-async mode, GDB seems to be making
adjustments to the serial descriptor. I see things like this
in the output (the "[fd468->...]" notifications).

| Sending packet: $vCont;s:2930#c0...[
| w $][v][C][o][n][t][;][s][:][2][9][3][0][#][c][0][fd468->asynchronous]
| [fd468->fd-scheduled]
| [fd468->timer-scheduled]
| [
| r $][T][0][5][0][5][:][c][8][f][e][2][8][0][0][;][0][4][:][a][0][f][e][2][8][0][0][;][0][8][:][1][4][1][6][4][0][0][0][;][t][h][r][e][a][d][:][2][9][3][0][;][#][e][fd468->fd-scheduled]
| [a]Packet received (fd-468): T0505:c8fe2800;04:a0fe2800;08:14164000;thread:2930;

On a vCont that does not work, I see:

| Sending packet: $vCont;s:114c;c#89...[
| w $][v][C][o][n][t][;][s][:][1][1][4][c][;][c][#][8][9][fd468->asynchronous]
| [fd468->fd-scheduled]

Note the absence of changing the state to timer-scheduled.

I don't know this code much. You? I am not sure whether this code
is in fact relevant to this issue or not; I just started looking
at it because I wanted to find a way to get to the file descriptor
number for the remote connection.

Comment 4 Pedro Alves 2014-06-06 16:25:38 UTC

I'm familiar with it, but a lot of the details have flushed out of cache.  :-)

I'll try setting up an environment to debug this.

Comment 5 brobecker 2014-06-06 23:29:39 UTC

I am discovering quite a bit with this PR, and one of the things
I discovered is that we use a thread to perform select calls.
That might explains the racy nature of this bug, and why adding
traces ahead of the call to WaitForMultipleObjects seems to hide
the issue.  Putting some small traces in net_windows_wait_handle
seems to indicate that sometimes the answer comes back fast enough
that we don't need to start the select thread, while other times
we do, and it's completely random - we must be right around
the edge. It doesn't explain yet why turning target-async mode on
would make any difference.

Comment 6 Pedro Alves 2014-06-09 16:04:13 UTC

I changed the test to loop forever, like:

void
increment (int *i)
{
  *i++;
}

int
main (void)
{
  int i = 0;

 again:
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  increment (&i);
  goto again;
}

And I managed to reproduce it too:

 Sending packet: $vCont;s:9c0;c#5c...
 *hang*

This is on Windows 7, 64-bit, in case we're hitting a Windows bug.

Comment 7 Pedro Alves 2014-06-09 16:05:40 UTC

> I am discovering quite a bit with this PR, and one of the things
> I discovered is that we use a thread to perform select calls.

Yeah, that's actually why I was thinking of trying to make gdb_select call select directly, since we're using sockets here (gdb_select is needed for things _other_ than sockets).  Not as a fix, but as a way to check that that's where the bug lies.

Comment 8 Pedro Alves 2014-06-09 16:20:09 UTC

I tried now reading commands from a script, to check whether stdin getting signalled as ready for gdb_select too early might have something to do with it.
But I can still trigger the issue that way:

$ cat test.cmd
tar rem :9999
tb increment
c

set debug remote 1
while 1
  next
  end

$ gdb ... -x test.cmd

...
19        increment (&i);
Sending packet: $vCont;s:610;c#27...Packet received: T0505:c8fe2800;04:a0fe2800;08:d0154000;thread:610;
Sending packet: $vCont;s:610;c#27...Packet received: T0505:c8fe2800;04:a0fe2800;08:d3154000;thread:610;
Sending packet: $vCont;s:610;c#27...
*hang*

Well, at least it's a little easier to reproduce that way.  :-)

Comment 9 Pedro Alves 2014-06-09 16:25:26 UTC

I confirm that I can't reproduce this with "maint set target-async off".

Comment 10 Pedro Alves 2014-06-09 17:07:22 UTC

> This is on Windows 7, 64-bit, in case we're hitting a Windows bug.

Actually, I was using 64-bit Windows, but my GDB and test program were both compiled as 32-bit.  Tried now real 64-bit, and I can still reproduce it, so WoW64 it not the culprit.

Comment 11 Pedro Alves 2014-06-09 21:07:55 UTC

Oh, I think I found something.  I noticed that the N argument to gdb_select was quite high.  So I added an assertion to check that we're not doing undefined things with select and the FD_SET, etc. macros.  See "man 3 select":

"The behavior of these macros is undefined if the fd argument is less than 0 or greater than or equal to FD_SETSIZE, or if fd is not a valid file descriptor,  or if any of the arguments are expressions with side effects."

diff --git c/gdb/event-loop.c w/gdb/event-loop.c
index 5999c97..662deb2 100644
--- c/gdb/event-loop.c
+++ w/gdb/event-loop.c
@@ -566,6 +566,8 @@ create_file_handler (int fd, int mask, handler_func * proc,
        }
       else
        {
+         gdb_assert (fd >= 0 && fd < FD_SETSIZE);
+
          if (mask & GDB_READABLE)
            FD_SET (fd, &gdb_notifier.check_masks[0]);
          else
diff --git c/gdb/gdbserver/event-loop.c w/gdb/gdbserver/event-loop.c
index dc1eb41..1512e32 100644
--- c/gdb/gdbserver/event-loop.c
+++ w/gdb/gdbserver/event-loop.c
@@ -300,6 +300,10 @@ create_file_handler (gdb_fildes_t fd, int mask, handler_func *proc,
       file_ptr->next_file = gdb_notifier.first_file_handler;
       gdb_notifier.first_file_handler = file_ptr;
 
+      fprintf (stderr, "fd=%d, FD_SETSIZE=%d\n", fd, FD_SETSIZE);
+
+      gdb_assert (fd >= 0 && fd < FD_SETSIZE);
+
       if (mask & GDB_READABLE)
        FD_SET (fd, &gdb_notifier.check_masks[0]);
       else


And lo:

E:\gdb\mygit\build-win32\gdb\gdbserver>gdbserver :9999 ../simple32.exe
Process ../simple32.exe created; pid = 1244
Listening on port 9999
fd=116, FD_SETSIZE=64
../../../src/gdb/gdbserver/event-loop.c:305: A problem internal to GDBserver has been detected.
create_file_handler: Assertion `fd >= 0 && fd < FD_SETSIZE' failed.

E:\gdb\mygit\build-win32\gdb\gdbserver>

Comment 12 Pedro Alves 2014-06-09 22:32:58 UTC

Nope, red herring.  That assertion isn't valid for Windows.  Socket descriptors are naturally high, and the FD_XXX macros handle it.  Please the first argument to 'select' is actually ignored...

Comment 13 Pedro Alves 2014-06-10 00:30:04 UTC

I think I found the race.  I'd added some debug logs to ser-mingw.c, and this is what I see:

w $][v][C][o][n][t][;][s][:][1][0][5][c][;][c][#][8][9][fd364->asynchronous]
[fd364->fd-scheduled]
net_windows_select_thread: WaitForMultipleObjects blocking     
net_windows_select_thread: WaitForMultipleObjects returned 1   (***)
gdb_select: WaitForMultipleObjects blocking (t=0)
gdb_select: WaitForMultipleObjects returned 102
gdb_select: WAIT_TIMEOUT
net_windows_select_thread: WaitForMultipleObjects blocking
gdb_select: WaitForMultipleObjects blocking (INFINITE)

Notice the line at (***), just above.  That indicates that the select thread noticed data was available.  But, gdb_select was called with timeout == 0, IOW, it was just a non-blocking poll ("is there data ready now?").  Recall that is running on a separate thread.  That WaitForMultipleObjects manages to run before net_windows_select_thread sets state->base.read_event, and returns WAIT_TIMEOUT.  GDB then goes back to the event loop, and calls a blocking/infinite gdb_select.  This loops through all fds, and calls 
net_windows_wait_handle.  Recall that state->base.read_event is set, with data waiting to be consumed.  But look, net_windows_wait_handle clears 
state->base.read_event!

This #if 0 seems to fix it for me:

static void
net_windows_wait_handle (struct serial *scb, HANDLE *read, HANDLE *except)
{
  struct net_windows_state *state = scb->state;
  /* Start from a clean slate.  */
#if 0
  ResetEvent (state->base.read_event);
  ResetEvent (state->base.except_event);
#endif
  ResetEvent (state->base.stop_select);

I haven't thought through consequences, but seems quite promising.

Comment 14 brobecker 2014-06-10 06:27:24 UTC

> I think I found the race.
[...]
> This #if 0 seems to fix it for me:
> 
> static void
> net_windows_wait_handle (struct serial *scb, HANDLE *read, HANDLE *except)
> {
>   struct net_windows_state *state = scb->state;
>   /* Start from a clean slate.  */
> #if 0
>   ResetEvent (state->base.read_event);
>   ResetEvent (state->base.except_event);
> #endif
>   ResetEvent (state->base.stop_select);
> 
> I haven't thought through consequences, but seems quite promising.

Indeed! I had seen that piece of code and questionned it, but somehow
convinced myself that the reset was a necessary initialization and
that it was always run before the select...

Comment 15 Pedro Alves 2014-06-10 18:20:34 UTC

Patch posted: https://sourceware.org/ml/gdb-patches/2014-06/msg00424.html

Comment 16 Sourceware Commits 2014-06-11 10:06:38 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "gdb and binutils".

The branch, master has been updated
       via  364fe1f72265eb54bce08511233d06ed48e9c41a (commit)
      from  7ed689ad61de0cbfe4e5a6f18f097776128202e4 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=364fe1f72265eb54bce08511233d06ed48e9c41a

commit 364fe1f72265eb54bce08511233d06ed48e9c41a
Author: Pedro Alves <palves@redhat.com>
Date:   Wed Jun 11 11:04:31 2014 +0100

    PR remote/17028: GDB+GDBserver hangs on Windows
    
    Since target-async was turned on by default, debugging on Windows
    using GDB+GDBserver sometimes hangs while waiting for a RSP reply.
    
    The problem is a race in the gdb_select machinery.
    
    This is what we see for a faulty next on the GDB side:
    
        (gdb) n
        infrun: clear_proceed_status_thread (Thread 4424)
        infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT, step=1)
        (...)
        infrun: resume (step=1, signal=GDB_SIGNAL_0), ...
        Sending packet: $vCont;s:1148;c#5e...
        *hang*
    
    At this point, attaching a debugger to the hanging GDB confirms that
    it is blocked, waiting for a socket event:
    
        #6  0x757841d8 in WaitForMultipleObjects ()
           from C:\Windows\syswow64\kernel32.dll
        #7  0x004708e7 in gdb_select (n=469, readfds=0x88ca50 <gdb_notifier+784>,
            writefds=0x88cb54 <gdb_notifier+1044>,
            exceptfds=0x88cc58 <gdb_notifier+1304>, timeout=0x0)
            at /[...]/gdb/mingw-hdep.c:172
        #8  0x00527926 in gdb_wait_for_event (block=1)
            at /[...]/gdb/event-loop.c:831
        #9  0x00526ff1 in gdb_do_one_event ()
            at /[...]/gdb/event-loop.c:403
    
    However, on the GDBserver side, we see that GDBserver already sent a
    T05 packet reply:
    
        gdbserver: kernel event EXCEPTION_DEBUG_EVENT for pid=4968 tid=1148
        EXCEPTION_SINGLE_STEP
        Child Stopped with signal = 5
        Writing resume reply for LWP 4968.4424:1
        DEBUG: write_prim ($T0505:c8fe2800;04:a0fe2800;08:38164000;thread:1148;#f0)
               -> 55
    
    To recap, on Windows, 'select' only works with sockets, so we have a
    wrapper, gdb_select, that uses the GDB serial abstraction to handle
    sockets, consoles, pipes, and serial ports.  Each serial descriptor
    has a thread associated (we call those the select threads), and those
    threads communicate with the main thread by means of standard Windows
    events.
    
    It basically goes like this: gdb_select first loops through all fds of
    interest, calling their wait_handle hooks, which returns an event that
    WaitForMultipleObjects can wait on.  gdb_select then blocks in
    WaitForMultipleObjects with all those event handles.  The wait_handle
    hook is responsible for arranging for the returned event to become set
    once data is available.  This is done by setting the descriptor's
    helper thread running, which itself knows how to wait for data from
    the type of handle it manages (sockets, pipes, consoles, files, etc.).
    Once data arrives, the select thread sets the corresponding event
    which unblocks WaitForMultipleObjects within gdb_select.  However, the
    wait_handle hook can also apply an optimization: if data is already
    pending, then there's no need to set the thread running, and the
    descriptors event can be set immediately.  It's around this latter
    aspect that lies the bug/race.
    
    Adding some ad hoc debug logs to ser-mingw.c and mingw-hdep.c, we see
    the following sequence of events, right after sending
    "$vCont;s:1148;c#5e".  Thread 1 is the main thread, and thread 2 is
    the socket's helper/select thread.  gdb_select was only passed one
    descriptor to wait on, the remote target's socket.
    net_windows_select_thread is the entry point of the select threads for
    sockets.
    
     #1 - thread 1: gdb_select: enter
     #2 - thread 2: net_windows_select_thread: WaitForMultipleObjects blocking
    
    gdb_select walked over the wait_handle hooks, and woke up the socket's
    helper thread.  The helper thread is now blocked waiting for socket
    events.
    
     #3 - thread 1: gdb_select: WaitForMultipleObjects polling (timeout=0ms)
     #4 - thread 1: gdb_select: WaitForMultipleObjects returned 102 (WAIT_TIMEOUT)
    
    There was no pending data available yet, and gdb_select was passed
    timeout==0ms, and so WaitForMultipleObjects times out immediately.
    
     #5 - thread 2: net_windows_select_thread: WaitForMultipleObjects returned 1
    
    Just afterwards, socket data arrives, and thread 2 wakes up.  Thread 2
    calls WSAEnumNetworkEvents, which clears state->sock_event, and marks
    the serial's read_event event, telling the main thread that data is
    available.
    
     #6 - thread 1: gdb_select: call serial_done_wait_handle on each serial
    
    gdb_select stops all the helper/select threads.
    
     #7 - thread 1: gdb_select: return 0 (WAIT_TIMEOUT)
    
    gdb_select in the main thread returns to the caller.
    
    Note that at this point, data is pending on the socket, the serial's
    read_event is set, but the socket's sock_event event is not set, until
    _further_ data arrives.
    
    Now GDB does its thing and goes back to the event loop.  That calls
    gdb_select, but with timeout==INFINITE.
    
    Again, gdb_select calls the socket serial's wait_handle hook.  It
    first clears its events, starting from a clean slate:
    
      ResetEvent (state->base.read_event);
      ResetEvent (state->base.except_event);
      ResetEvent (state->base.stop_select);
    
    That cleared read_event, which was previously set in #5 above.  And
    then it checks for pending events, in the sock_event event:
    
      /* Check any pending events.  This both avoids starting the thread
         unnecessarily, and handles stray FD_READ events (see below).  */
      if (WaitForSingleObject (state->sock_event, 0) == WAIT_OBJECT_0)
        {
    
    That also fails because state->sock_event was cleared in #5 too...
    
    So the wait_handle hook erroneously decides that it needs to start the
    helper thread to wait for input:
    
     #8 - thread 2: net_windows_select_thread: WaitForMultipleObjects blocking
     #9 - thread 1: gdb_select: WaitForMultipleObjects blocking (INFINITE)
    
    But, GDBserver already sent all it had to send, so both threads waits
    forever...
    
    At first I thought that net_windows_wait_handle shouldn't be resetting
    state->base.read_event or state->base.except_event, but looking
    deeper, the pipe and console wait_handle hooks reset all events too.
    It actually makes sense that way -- consuming an event from different
    threads is bad practice, and, we should always be able to query
    pending state without looking at the state->sock_event from within
    net_windows_wait_handle.  The end result is much simpler, and makes
    net_windows_select_thread look a lot like console_select_thread,
    actually.
    
    gdb/
    2014-06-11  Pedro Alves  <palves@redhat.com>
    
    	PR remote/17028
    	* ser-mingw.c (net_windows_socket_check_pending): New function.
    	(net_windows_select_thread): Ignore spurious wakeups.  Use
    	net_windows_socket_check_pending.
    	(net_windows_wait_handle): Check for pending events with
    	ioctlsocket, through net_windows_socket_check_pending, instead of
    	checking the socket's event.

-----------------------------------------------------------------------

Summary of changes:
 gdb/ChangeLog   |   10 ++++
 gdb/ser-mingw.c |  141 +++++++++++++++++++++++++++----------------------------
 2 files changed, 79 insertions(+), 72 deletions(-)

Comment 17 Pedro Alves 2014-06-11 10:18:31 UTC

Fixed.