26199 – GDB goes in busy loop when interrupting non-stop program

Bug 26199 - GDB goes in busy loop when interrupting non-stop program

Summary: GDB goes in busy loop when interrupting non-stop program

Status:	RESOLVED FIXED

Alias:	None

Product:	gdb
Classification:	Unclassified
Component:	gdb (show other bugs)
Version:	HEAD

Importance:	P2 normal
Target Milestone:	10.1
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-07-02 21:47 UTC by Simon Marchi
Modified:	2020-07-10 23:05 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Marchi 2020-07-02 21:47:42 UTC

When interrupting a program in non-stop, the program gets interrupted correctly, but GDB busy loops (the event loop is always woken up).

This is what I did:

1. Start GDB: ./gdb -nx --data-directory=data-directory -ex "set non-stop 1" --args  /bin/sleep 60
2. Run the program with "run"
3. Interrupt with ^C.
4. Look into htop, see GDB taking 100% CPU

Debugging `handle_file_event`, we see that the event source that wakes up the event loop is the linux-nat one:

(top-gdb) p file_ptr.proc 
$5 = (handler_func *) 0xb9cccd <handle_target_event(int, gdb_client_data)>
                                ^^^^^^^^^^^^^^^^^^^- the linux-nat callback

Debugging fetch_inferior_event and do_target_wait, we see that we don't actually call `wait` on the linux-nat target, because inferior_matches returns false:

      auto inferior_matches = [&wait_ptid] (inferior *inf)
        {
          return (inf->process_target () != NULL
                  && (threads_are_executing (inf->process_target ())
                      || threads_are_resumed_pending_p (inf))
                  && ptid_t (inf->pid).matches (wait_ptid));
        };

because `threads_are_executing` is false.

So what I'm guess happens is:

1. User types ctrl-c, that writes in the linux-nat pipe, waking up the event source
2. linux-nat's wait gets called, the SIGINT event is returned, but before returning, it marks the pipe again, in order for wait to get called again:

   /* If we requested any event, and something came out, assume there
      may be more.  If we requested a specific lwp or process, also
      assume there may be more.  */
   if (target_is_async_p ()
       && ((ourstatus->kind != TARGET_WAITKIND_IGNORE
            && ourstatus->kind != TARGET_WAITKIND_NO_RESUMED)
           || ptid != minus_one_ptid))
     async_file_mark ();


3. The SIGINT event is handled, the program is stopped, the stop notification is printed
4. The event loop is woken up again because of the `async_file_mark` of step 2.
5. Because `inferior_matches` returns false, we never call linux-nat's wait, so the pipe stays readable.  Rinse and repeat.

The first commit that does this is the multi-target one (5b6d1e4fa4fc6 "Multi-target support").

Comment 1 Sourceware Commits 2020-07-10 22:57:31 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=96118d114e3c53aadaf3fe5b5cf94979dbf56d87

commit 96118d114e3c53aadaf3fe5b5cf94979dbf56d87
Author: Pedro Alves <pedro@palves.net>
Date:   Fri Jul 10 23:39:34 2020 +0100

    Fix spurious unhandled remote %Stop notifications
    
    In non-stop mode, remote targets mark an async event source whose
    callback is supposed to result in calling remote_target::wait_ns to
    either process the event queue, or acknowledge an incoming %Stop
    notification.
    
    The callback in question is remote_async_inferior_event_handler, where
    we call inferior_event_handler, to end up in fetch_inferior_event ->
    target_wait -> remote_target::wait -> remote_target::wait_ns.
    
    A problem here however is that when debugging multiple targets,
    fetch_inferior_event can pull events out of any target picked at
    random, for event fairness.  This means that when
    remote_async_inferior_event_handler returns, remote_target::wait may
    have not been called at all, and thus pending notifications may have
    not been acked.  Because async event sources auto-clear, when
    remote_async_inferior_event_handler returns the async event handler is
    no longer marked, so the event loop won't automatically call
    remote_async_inferior_event_handler again to try to process the
    pending remote notifications/queue.  The result is that stop events
    may end up not processed, e.g., "interrupt -a" seemingly not managing
    to stop all threads.
    
    Fix this by making remote_async_inferior_event_handler mark the event
    handler again before returning, if necessary.
    
    Maybe a better fix would be to make async event handlers not
    auto-clear themselves, make that the responsibility of the callback,
    so that the event loop would keep calling the callback automatically.
    Or, we could try making so that fetch_inferior_event would optionally
    handle events only for the target that it got passed down via
    parameter.  However, I don't think now just before branching is the
    time to try to do any such change.
    
    gdb/ChangeLog:
    
            PR gdb/26199
            * remote.c (remote_target::open_1): Pass remote target pointer as
            data to create_async_event_handler.
            (remote_async_inferior_event_handler): Mark async event handler
            before returning if the remote target still has either pending
            events or unacknowledged notifications.

Comment 2 Sourceware Commits 2020-07-10 22:57:36 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=43667cc6f65e60e2c15f3bb84e45730b537db5fa

commit 43667cc6f65e60e2c15f3bb84e45730b537db5fa
Author: Pedro Alves <pedro@palves.net>
Date:   Sat Jul 4 19:12:30 2020 +0100

    Fix latent bug in target_pass_ctrlc
    
    We were checking the thr->executing of an exited thread.
    
    gdb/ChangeLog:
    
            PR gdb/26199
            * target.c (target_pass_ctrlc): Look at the inferior's non-exited
            threads, not all threads.

Comment 3 Sourceware Commits 2020-07-10 22:57:41 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=42bd97a6b1e5fa845af116ce52af1a8a3a58be7c

commit 42bd97a6b1e5fa845af116ce52af1a8a3a58be7c
Author: Pedro Alves <pedro@palves.net>
Date:   Sat Jul 4 19:31:21 2020 +0100

    Avoid constant stream of TARGET_WAITKIND_NO_RESUMED
    
    If we hit the synchronous execution command case described by
    handle_no_resumed, and handle_no_resumed determines that the event
    should be ignored, because it found a thread that is executing, we end
    up in prepare_to_wait.
    
    There, if the current target is not registered in the event loop right
    now, we call mark_infrun_async_event_handler.  With that event handler
    marked, the event loop calls again into fetch_inferior_event, which
    calls target_wait, which returns TARGET_WAITKIND_NO_RESUMED, and we
    end up in handle_no_resumed, again ignoring the event and marking
    infrun_async_event_handler.  The result is that GDB is now always
    keeping the CPU 100% busy in this loop, even though it continues to be
    able to react to input and to real target events, because we still go
    through the event-loop.
    
    The problem is that marking of the infrun_async_event_handler in
    prepare_to_wait.  That is there to handle targets that don't support
    asynchronous execution.  So the correct predicate is whether async
    execution is supported, not whether the target is async right now.
    
    gdb/ChangeLog:
    
            PR gdb/26199
            * infrun.c (prepare_to_wait): Check target_can_async_p instead of
            target_is_async_p.

Comment 4 Sourceware Commits 2020-07-10 22:57:47 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=7d3badc6a88b510f96c9aa0bab8f3375292d23bf

commit 7d3badc6a88b510f96c9aa0bab8f3375292d23bf
Author: Pedro Alves <pedro@palves.net>
Date:   Sat Jul 4 19:26:59 2020 +0100

    Fix handle_no_resumed w/ multiple targets
    
    handle_no_resumed is currently not considering multiple targets.
    
    Say you have two inferiors 1 and 2, each connected to a different
    target, A and B.
    
    Now say you set inferior 2 running, with "continue &".
    
    Now you select a thread of inferior 1, say thread 1.2, and continue in
    the foreground.  All other threads of inferior 1 are left stopped.
    Thread 1.2 exits, and thus target A has no other resumed thread, so it
    reports TARGET_WAITKIND_NO_RESUMED.
    
    At this point, if both inferiors were running in the same target,
    handle_no_resumed would realize that threads of inferior 2 are still
    executing, so the TARGET_WAITKIND_NO_RESUMED event should be ignored.
    But because handle_no_resumed only walks the threads of the current
    target, it misses noticing that threads of inferior 2 are still
    executing.  The fix is just to walk over all threads of all targets.
    
    A testcase covering the use case above will be added in a following
    patch.  It can't be added yet because it depends on yet another fix to
    handle_no_resumed not included here.
    
    gdb/ChangeLog:
    
            PR gdb/26199
            * infrun.c (handle_no_resumed): Handle multiple targets.

Comment 5 Sourceware Commits 2020-07-10 22:57:52 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=d6cc5d980a902d273d424c49fc55e77757c3a05d

commit d6cc5d980a902d273d424c49fc55e77757c3a05d
Author: Pedro Alves <pedro@palves.net>
Date:   Sat Jul 4 20:51:36 2020 +0100

    Make handle_no_resumed transfer terminal
    
    Let's consider the same use case as in the previous commit:
    
    Say you have two inferiors 1 and 2, each connected to a different
    target, A and B.
    
    Now say you set inferior 2 running, with "continue &".
    
    Now you select a thread of inferior 1, say thread 1.2, and continue in
    the foreground.  All other threads of inferior 1 are left stopped.
    Thread 1.2 exits, and thus target A has no other resumed thread, so it
    reports TARGET_WAITKIND_NO_RESUMED.
    
    At this point, because the threads of inferior 2 are still executing
    the TARGET_WAITKIND_NO_RESUMED event is ignored.
    
    Now, the user types Ctrl-C.  Because GDB had previously put inferior 1
    in the foreground, the kernel sends the SIGINT to that inferior.
    However, no thread in that inferior is executing right now, so ptrace
    never intercepts the SIGINT -- it is never dequeued by any thread.
    The result is that GDB's CLI is stuck.  There's no way to get back the
    prompt (unless inferior 2 happens to report some event).
    
    The fix in this commit is to make handle_no_resumed give the terminal
    to some other inferior that still has threads executing so that a
    subsequent Ctrl-C reaches that target first (and then GDB intercepts
    the SIGINT).  This is a bit hacky, but seems like the best we can do
    with the current design.
    
    I think that putting all native inferiors in their own session would
    help fixing this in a clean way, since with that a Ctrl-C on GDB's
    terminal will _always_ reach GDB first, and then GDB can decide how to
    pause the inferior.  But that's a much larger change.
    
    The testcase added by the following patch needs this fix.
    
    gdb/ChangeLog:
    
            PR gdb/26199
            * infrun.c (handle_no_resumed): Transfer terminal to inferior with
            executing threads.

Comment 6 Sourceware Commits 2020-07-10 22:57:57 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=4fdba23df6d202b9d428818fc209e527797b576f

commit 4fdba23df6d202b9d428818fc209e527797b576f
Author: Pedro Alves <pedro@palves.net>
Date:   Sat Jul 4 19:26:59 2020 +0100

    Testcase for previous handle_no_resumed fixes
    
    This adds a testcase that covers the scenarios described in the
    previous two commits.
    
    gdb/testsuite/ChangeLog:
    
            PR gdb/26199
            * gdb.multi/multi-target.c (exit_thread): New.
            (thread_start): Break loop if EXIT_THREAD.
            * gdb.multi/multi-target.exp (test_no_unwaited_for): New proc.
            (top level) Call test_no_resumed.

Comment 7 Sourceware Commits 2020-07-10 22:58:02 UTC

The master branch has been updated by Pedro Alves <palves@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=b3e3a4c11496dca710c62e32db80e27dd7301223

commit b3e3a4c11496dca710c62e32db80e27dd7301223
Author: Simon Marchi <simon.marchi@polymtl.ca>
Date:   Sat Jul 4 13:33:19 2020 +0100

    Fix GDB busy loop when interrupting non-stop program (PR 26199)
    
    When interrupting a program in non-stop, the program gets interrupted
    correctly, but GDB busy loops (the event loop is always woken up).
    
    Here is how to reproduce it:
    
     1. Start GDB: ./gdb -nx --data-directory=data-directory -ex "set non-stop 1" --args  /bin/sleep 60
     2. Run the program with "run"
     3. Interrupt with ^C.
     4. Look into htop, see GDB taking 100% CPU
    
    Debugging `handle_file_event`, we see that the event source that wakes
    up the event loop is the linux-nat one:
    
     (top-gdb) p file_ptr.proc
     $5 = (handler_func *) 0xb9cccd <handle_target_event(int, gdb_client_data)>
                                     ^^^^^^^^^^^^^^^^^^^
                                             |
                                             \-- the linux-nat callback
    
    Debugging fetch_inferior_event and do_target_wait, we see that we
    don't actually call `wait` on the linux-nat target, because
    inferior_matches returns false:
    
     auto inferior_matches = [&wait_ptid] (inferior *inf)
       {
         return (inf->process_target () != NULL
                 && (threads_are_executing (inf->process_target ())
                     || threads_are_resumed_pending_p (inf))
                 && ptid_t (inf->pid).matches (wait_ptid));
       };
    
    because `threads_are_executing` is false.
    
    What happens is:
    
     1. User types ctrl-c, that writes in the linux-nat pipe, waking up
        the event source.
    
     2. linux-nat's wait gets called, the SIGINT event is returned, but
        before returning, it marks the pipe again, in order for wait to
        get called again:
    
        /* If we requested any event, and something came out, assume there
           may be more.  If we requested a specific lwp or process, also
           assume there may be more.  */
        if (target_is_async_p ()
            && ((ourstatus->kind != TARGET_WAITKIND_IGNORE
                 && ourstatus->kind != TARGET_WAITKIND_NO_RESUMED)
                || ptid != minus_one_ptid))
          async_file_mark ();
    
     3. The SIGINT event is handled, the program is stopped, the stop
        notification is printed.
    
     4. The event loop is woken up again because of the `async_file_mark`
        of step 2.
    
     5. Because `inferior_matches` returns false, we never call
        linux-nat's wait, so the pipe stays readable.
    
     6. Goto 4.
    
    Pedro says:
    
    This commit fixes it by letting do_target_wait call target_wait even
    if threads_are_executing is false.  This will normally result in the
    target returning TARGET_WAITKIND_NO_RESUMED, and _not_ marking its
    event source again.  This results in infrun only calling into the
    target only once (i.e., breaking the busy loop).
    
    Note that the busy loop bug didn't trigger in all-stop mode because
    all-stop handles this by unregistering the target from the event loop
    as soon as it was all stopped -- see
    inf-loop.c:inferior_event_handler's INF_EXEC_COMPLETE handling.  If we
    remove that non-stop check from inferior_event_handler, and replace
    the target_has_execution check for threads_are_executing instead, it
    also fixes the issue for non-stop.  I considered that as the final
    solution, but decided that the solution proposed here instead is just
    simpler and more future-proof design.  With the
    TARGET_WAITKIND_NO_RESUMED handling fixes done in the previous
    patches, I think it should be possible to always keep the target
    registered in the event loop, meaning we could eliminate the
    target_async(0) call from inferior_event_handler as well as most of
    the target_async(1) calls in the target backends.  That would allow in
    the future e.g., the remote target reporting asynchronous
    notifications even if all threads are stopped.  I haven't attempted
    that, though.
    
    gdb/ChangeLog:
    yyyy-mm-dd  Simon Marchi  <simon.marchi@polymtl.ca>
                Pedro Alves  <pedro@palves.net>
    
            PR gdb/26199
            * infrun.c (threads_are_resumed_pending_p): Delete.
            (do_target_wait): Remove threads_are_executing and
            threads_are_resumed_pending_p checks from the inferior_matches
            lambda.  Update comments.

Comment 8 Pedro Alves 2020-07-10 23:05:46 UTC

Fixed.