Bug 28942

Summary: Problem with breakpoint condition calling a function in multi-threaded program
Product: gdb Reporter: Simon Marchi <simon.marchi>
Component: gdbAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: aburgess, mingwei.zhang, ppluzhnikov, simark, tankut.baris.aktemur, tromey
Priority: P2    
Version: HEAD   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed: 2022-03-04 00:00:00
Attachments: A WIP patch

Description Simon Marchi 2022-03-03 19:40:58 UTC
This program:

---8<---
#include <pthread.h>
#include <unistd.h>

static void
function_that_segfaults (void)
{
  int *p = 0;
  *p = 1;
}

static void
break_here (void)
{}

static void *
thread_func (void *p)
{
  for (;;)
    sleep (1);
  return NULL;
}

static void *
thread_func2 (void *p)
{
  sleep (1);
  break_here ();
  return NULL;
}

int
main (void)
{
  pthread_t threads[10];
  pthread_create (&threads[0], NULL, thread_func, NULL);
  pthread_create (&threads[1], NULL, thread_func, NULL);
  pthread_create (&threads[2], NULL, thread_func, NULL);
  pthread_create (&threads[3], NULL, thread_func, NULL);
  pthread_create (&threads[5], NULL, thread_func, NULL);
  pthread_create (&threads[6], NULL, thread_func, NULL);
  pthread_create (&threads[4], NULL, thread_func2, NULL);
  sleep (60);
  return function_that_segfaults != 0;
}

--->8---


$ gcc test.c  -g3 -O0 -pthread
$ ./gdb -q -nx --data-directory=data-directory a.out -ex "b break_here if function_that_segfaults()"
Reading symbols from a.out...
Breakpoint 1 at 0x11ae: file test.c, line 13.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb/gdb/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7d99700 (LWP 3567019)]
[New Thread 0x7ffff7598700 (LWP 3567020)]
[New Thread 0x7ffff6d97700 (LWP 3567021)]
[New Thread 0x7ffff6596700 (LWP 3567022)]
[New Thread 0x7ffff5d95700 (LWP 3567023)]
[New Thread 0x7ffff5594700 (LWP 3567024)]
[New Thread 0x7ffff4d93700 (LWP 3567025)]
Error in testing breakpoint condition:
Couldn't get registers: No such process.
An error occurred while in a function called from GDB.
Evaluation of the expression containing the function
(function_that_segfaults) will be abandoned.
When the function is done executing, GDB will silently stop.
Selected thread is running.
(gdb) 

The "Couldn't get registers: No such process." is very strange.  We expect GDB to say that the thread received a signal (SIGSEGV) while running the hand-called function.

And then if you continue with:

(gdb) kill                                                                                                                                                                                                                                                                                
Kill the program being debugged? (y or n) y
[Inferior 1 (process 3567034) killed]
(gdb) r                                                                                                                                                                                                                                                                                   
Starting program: /home/smarchi/build/binutils-gdb/gdb/a.out                                                                                                                                                                                                                              
/home/smarchi/src/binutils-gdb/gdb/target.c:2607: internal-error: target_wait: Assertion `!proc_target->commit_resumed_state' failed.                                                                                                                                                     
A problem internal to GDB has been detected,                                                                                                                                                                                                                                              
further debugging may prove unreliable.

Looking at the proceed call here:

(top-gdb) bt
#0  proceed (addr=0x555555555189, siggnal=GDB_SIGNAL_0) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:3046
#1  0x0000558e5d95a128 in run_inferior_call (sm=std::unique_ptr<call_thread_fsm> = {...}, call_thread=0x61700009e680, real_pc=0x555555555189) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:610
#2  0x0000558e5d95ff6e in call_function_by_hand_dummy (function=0x611000489d00, default_return_type=0x0, args=..., dummy_dtor=0x0, dummy_dtor_data=0x0) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:1279
#3  0x0000558e5d95b4be in call_function_by_hand (function=0x611000489d00, default_return_type=0x0, args=...) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:741
#4  0x0000558e5d609a2e in evaluate_subexp_do_call (exp=0x6030001579f0, noside=EVAL_NORMAL, callee=0x611000489d00, argvec=..., function_name=0x0, default_return_type=0x0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:674
#5  0x0000558e5d60a7c5 in expr::operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, function_name=0x0, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:702
#6  0x0000558e5c4090aa in expr::operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/expression.h:136
#7  0x0000558e5d60ad63 in expr::var_value_operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:714
#8  0x0000558e5cb8d2be in expr::funcall_operation::evaluate (this=0x607000083f80, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/expop.h:2178
#9  0x0000558e5d604e00 in expression::evaluate (During symbol reading: Child DIE 0x8d876c and its abstract origin 0x8f9b2b have different parents
sthis=0x6030001579f0, expect_type=0x0, noside=EVAL_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/eval.c:101
#10 0x0000558e5d604f71 in evaluate_expression (exp=0x6030001579f0, expect_type=0x0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:115
#11 0x0000558e5c8c99b9 in breakpoint_cond_eval (exp=0x6030001579f0) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:4739
#12 0x0000558e5c8d1f11 in bpstat_check_breakpoint_conditions (bs=0x6060001b29c0, thread=0x61700009e680) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:5303
#13 0x0000558e5c8d4b45 in bpstat_stop_status (aspace=0x603000045a00, bp_addr=0x5555555551ae, thread=0x61700009e680, ws=..., stop_chain=0x6060001b29c0) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:5475
#14 0x0000558e5da1f939 in handle_signal_stop (ecs=0x7fff97a4bd50) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:6200
#15 0x0000558e5da19441 in handle_inferior_event (ecs=0x7fff97a4bd50) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:5690
#16 0x0000558e5da05206 in fetch_inferior_event () at /home/smarchi/src/binutils-gdb/gdb/infrun.c:4091
#17 0x0000558e5d94fad4 in inferior_event_handler (event_type=INF_REG_EVENT) at /home/smarchi/src/binutils-gdb/gdb/inf-loop.c:41
#18 0x0000558e5dc29bdd in handle_target_event (error=0, client_data=0x0) at /home/smarchi/src/binutils-gdb/gdb/linux-nat.c:4096
#19 0x0000558e5f4e4dd1 in handle_file_event (file_ptr=0x607000016050, ready_mask=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:574
#20 0x0000558e5f4e562c in gdb_wait_for_event (block=0) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:700
#21 0x0000558e5f4e343c in gdb_do_one_event () at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:212
#22 0x0000558e5dd29d99 in start_event_loop () at /home/smarchi/src/binutils-gdb/gdb/main.c:421
#23 0x0000558e5dd2a1df in captured_command_loop () at /home/smarchi/src/binutils-gdb/gdb/main.c:481
#24 0x0000558e5dd2fad9 in captured_main (data=0x7fff97a4c200) at /home/smarchi/src/binutils-gdb/gdb/main.c:1348
#25 0x0000558e5dd2fbc2 in gdb_main (args=0x7fff97a4c200) at /home/smarchi/src/binutils-gdb/gdb/main.c:1363
#26 0x0000558e5c3e1ddd in main (argc=7, argv=0x7fff97a4c378) at /home/smarchi/src/binutils-gdb/gdb/gdb.c:32


We find that GDB tries to resume some other threads than the event thread (for which we evaluate the breakpoint condition), because it thinks they are not resumed. Probably because when the linux-nat target added them, they were added in the non-resumed state and stayed this way.
Comment 1 Andrew Burgess 2022-03-04 11:15:27 UTC
Wow, it's a small world.  I literally  just started looking at this same issue this week.

The whole thread not marked resumed issue is fixed by this excellent patch:

  https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html

Which you know as you already posted a link to this bug to that thread.

However, there are so many other problem related to this issue.

The first thing I noticed is that run_inferior_call calls clear_proceed_status, which in all-stop mode calls clear_proceed_status_thread for each thread.

Once the above patch is merged I plan to add an assert to clear_proceed_status_thread that the thread we are clearing is not resumed and not executing.

Currently the not-executing assert will fail, but (due to the above patch being missing) the not-resumed assert will only fail sometimes.

If we ignore the clear_proceed_status issue, then with the above patch the resumed flag will be correct, and GDB will not try to start the already resumed threads as part of the inferior call.

However, after the call, as we're in all-stop mode, GDB will stop all threads.

However, if the breakpoint condition doesn't segfault, but instead just returns false, then GDB will resume the single thread that stopped for the breakpoint - leaving all the other threads stopped.

I'm currently working on the idea that when we evaluate the breakpoint condition we temporarily place GDB into non-stop mode, this would mean that, when we evaluate the b/p condition we only restart the one thread, and afterwards, we only expect the one thread to stop, but I need to do lots more testing yet - maybe this is a really bad idea.

The only other option I can think of is to somehow have the infcall code figure out that we are in all-stop mode, but some threads are already running.  Then, after making the inferior call we only stop the set of threads that we started.  However, this has a massive problem; how to handle new threads?

I'll clean up my correct patch and post it to this bug later today in case anyone wants to try it.  I'll also add your crashing function test to my working branch to make sure that is handled too.
Comment 2 Andrew Burgess 2022-03-04 14:01:08 UTC
Created attachment 14005 [details]
A WIP patch

Here's the patch I'm currently working on.  This should apply to current master and resolves the issue in this bug, as well as the original issue I was working on.  I've run the complete testsuite on GNU/Linux x86-64 with no regressions.

I still need to do lots more testing, especially around things like handling targets that don't support non-stop mode, and what happens if some other thread stops while we are evaluating the breakpoint condition.

But any initial thoughts are welcome.
Comment 3 Simon Marchi 2022-03-04 14:44:08 UTC
(In reply to Andrew Burgess from comment #1)
> Wow, it's a small world.  I literally  just started looking at this same
> issue this week.
> 
> The whole thread not marked resumed issue is fixed by this excellent patch:
> 
>   https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html
> 
> Which you know as you already posted a link to this bug to that thread.
> 
> However, there are so many other problem related to this issue.
> 
> The first thing I noticed is that run_inferior_call calls
> clear_proceed_status, which in all-stop mode calls
> clear_proceed_status_thread for each thread.
> 
> Once the above patch is merged I plan to add an assert to
> clear_proceed_status_thread that the thread we are clearing is not resumed
> and not executing.
> 
> Currently the not-executing assert will fail, but (due to the above patch
> being missing) the not-resumed assert will only fail sometimes.
> 
> If we ignore the clear_proceed_status issue, then with the above patch the
> resumed flag will be correct, and GDB will not try to start the already
> resumed threads as part of the inferior call.
> 
> However, after the call, as we're in all-stop mode, GDB will stop all
> threads.
> 
> However, if the breakpoint condition doesn't segfault, but instead just
> returns false, then GDB will resume the single thread that stopped for the
> breakpoint - leaving all the other threads stopped.

Yeah, the fact that the breakpoint condition function caused a segfault is just another difficulty on top.  You can ignore that part.

> I'm currently working on the idea that when we evaluate the breakpoint
> condition we temporarily place GDB into non-stop mode, this would mean that,
> when we evaluate the b/p condition we only restart the one thread, and
> afterwards, we only expect the one thread to stop, but I need to do lots
> more testing yet - maybe this is a really bad idea.
> 
> The only other option I can think of is to somehow have the infcall code
> figure out that we are in all-stop mode, but some threads are already
> running.  Then, after making the inferior call we only stop the set of
> threads that we started.  However, this has a massive problem; how to handle
> new threads?

When thinking about this, my intuition was more like the later.

In all-stop over a non-stop target:

1. A thread hits a breakpoint, only that thread is stopped while we process the breakpoint hit
2. When doing the infcall in the breakpoint condition, only that thread is resumed (the other threads already are)
3. When the infcall is done, only that thread is stopped
4a. If the condition is true, then GDB stops all threads
4b. if the condition is false, that thread is resumed

In all-stop over an all-stop target:

1. A thread hits a breakpoint, all threads are stopped while we process the breakpoint hit
2. When doing the infcall in the breakpoint condition, all threads are resumed (is this what would happen if the user were to do a manual infcall?)
3. When the infcall is done, all threads are stopped
4a. If the condition is true, all threads remain stopped
4b. If the condition is false, all threads are resumed

In non-stop over a non-stop target, then it looks like "all-stop-on-top-of-non-stop", except that not all threads are stopped in step 4a.

I didn't really think through what would happen to new threads, I suppose they would just keep running.

> 
> I'll clean up my correct patch and post it to this bug later today in case
> anyone wants to try it.  I'll also add your crashing function test to my
> working branch to make sure that is handled too.

Thanks, that's some really quick customer service.
Comment 4 Baris Aktemur 2022-03-07 07:34:57 UTC
A highly-related patch series was this:

  https://sourceware.org/pipermail/gdb-patches/2021-March/176654.html

Perhaps there are a few useful things that still apply to the current master.

> In all-stop over an all-stop target:
>
> 1. A thread hits a breakpoint, all threads are stopped while we process
> the breakpoint hit
> 2. When doing the infcall in the breakpoint condition, all threads are
> resumed (is this what would happen if the user were to do a manual infcall?)

I think GDB should act like the "scheduler-locking on" mode in this case,
because if another thread has a pending event, the condition evaluation
could be dismissed.  This is what distinguishes an infcall in condition
evaluation from a manual infcall.  The series linked above introduced an
`in_cond_eval` flag to make this distinction.
Comment 6 Tom Tromey 2022-10-21 17:57:30 UTC
*** Bug 23191 has been marked as a duplicate of this bug. ***
Comment 7 Tom Tromey 2022-10-21 17:58:28 UTC
*** Bug 28911 has been marked as a duplicate of this bug. ***
Comment 8 Sourceware Commits 2024-03-25 17:40:59 UTC
The master branch has been updated by Andrew Burgess <aburgess@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=3df7843699ff3610f89ac880685396b531d8ec1b

commit 3df7843699ff3610f89ac880685396b531d8ec1b
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Fri Oct 9 13:27:13 2020 +0200

    gdb: fix b/p conditions with infcalls in multi-threaded inferiors
    
    This commit fixes bug PR 28942, that is, creating a conditional
    breakpoint in a multi-threaded inferior, where the breakpoint
    condition includes an inferior function call.
    
    Currently, when a user tries to create such a breakpoint, then GDB
    will fail with:
    
      (gdb) break infcall-from-bp-cond-single.c:61 if (return_true ())
      Breakpoint 2 at 0x4011fa: file /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/infcall-from-bp-cond-single.c, line 61.
      (gdb) continue
      Continuing.
      [New Thread 0x7ffff7c5d700 (LWP 2460150)]
      [New Thread 0x7ffff745c700 (LWP 2460151)]
      [New Thread 0x7ffff6c5b700 (LWP 2460152)]
      [New Thread 0x7ffff645a700 (LWP 2460153)]
      [New Thread 0x7ffff5c59700 (LWP 2460154)]
      Error in testing breakpoint condition:
      Couldn't get registers: No such process.
      An error occurred while in a function called from GDB.
      Evaluation of the expression containing the function
      (return_true) will be abandoned.
      When the function is done executing, GDB will silently stop.
      Selected thread is running.
      (gdb)
    
    Or, in some cases, like this:
    
      (gdb) break infcall-from-bp-cond-simple.c:56 if (is_matching_tid (arg, 1))
      Breakpoint 2 at 0x401194: file /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/infcall-from-bp-cond-simple.c, line 56.
      (gdb) continue
      Continuing.
      [New Thread 0x7ffff7c5d700 (LWP 2461106)]
      [New Thread 0x7ffff745c700 (LWP 2461107)]
      ../../src.release/gdb/nat/x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
      A problem internal to GDB has been detected,
      further debugging may prove unreliable.
    
    The precise error depends on the exact thread state; so there's race
    conditions depending on which threads have fully started, and which
    have not.  But the underlying problem is always the same; when GDB
    tries to execute the inferior function call from within the breakpoint
    condition, GDB will, incorrectly, try to resume threads that are
    already running - GDB doesn't realise that some threads might already
    be running.
    
    The solution proposed in this patch requires an additional member
    variable thread_info::in_cond_eval.  This flag is set to true (in
    breakpoint.c) when GDB is evaluating a breakpoint condition.
    
    In user_visible_resume_ptid (infrun.c), when the in_cond_eval flag is
    true, then GDB will only try to resume the current thread, that is,
    the thread for which the breakpoint condition is being evaluated.
    This solves the problem of GDB trying to resume threads that are
    already running.
    
    The next problem is that inferior function calls are assumed to be
    synchronous, that is, GDB doesn't expect to start an inferior function
    call in thread #1, then receive a stop from thread #2 for some other,
    unrelated reason.  To prevent GDB responding to an event from another
    thread, we update fetch_inferior_event and do_target_wait in infrun.c,
    so that, when an inferior function call (on behalf of a breakpoint
    condition) is in progress, we only wait for events from the current
    thread (the one evaluating the condition).
    
    In do_target_wait I had to change the inferior_matches lambda
    function, which is used to select which inferior to wait on.
    Previously the logic was this:
    
       auto inferior_matches = [&wait_ptid] (inferior *inf)
         {
           return (inf->process_target () != nullptr
                   && ptid_t (inf->pid).matches (wait_ptid));
         };
    
    This compares the pid of the inferior against the complete ptid we
    want to wait on.  Before this commit wait_ptid was only ever
    minus_one_ptid (which is special, and means any process), and so every
    inferior would match.
    
    After this commit though wait_ptid might represent a specific thread
    in a specific inferior.  If we compare the pid of the inferior to a
    specific ptid then these will not match.  The fix is to compare
    against the pid extracted from the wait_ptid, not against the complete
    wait_ptid itself.
    
    In fetch_inferior_event, after receiving the event, we only want to
    stop all the other threads, and call inferior_event_handler with
    INF_EXEC_COMPLETE, if we are not evaluating a conditional breakpoint.
    If we are, then all the other threads should be left doing whatever
    they were before.  The inferior_event_handler call will be performed
    once the breakpoint condition has finished being evaluated, and GDB
    decides to stop or not.
    
    The final problem that needs solving relates to GDB's commit-resume
    mechanism, which allows GDB to collect resume requests into a single
    packet in order to reduce traffic to a remote target.
    
    The problem is that the commit-resume mechanism will not send any
    resume requests for an inferior if there are already events pending on
    the GDB side.
    
    Imagine an inferior with two threads.  Both threads hit a breakpoint,
    maybe the same conditional breakpoint.  At this point there are two
    pending events, one for each thread.
    
    GDB selects one of the events and spots that this is a conditional
    breakpoint, GDB evaluates the condition.
    
    The condition includes an inferior function call, so GDB sets up for
    the call and resumes the one thread, the resume request is added to
    the commit-resume queue.
    
    When the commit-resume queue is committed GDB sees that there is a
    pending event from another thread, and so doesn't send any resume
    requests to the actual target, GDB is assuming that when we wait we
    will select the event from the other thread.
    
    However, as this is an inferior function call for a condition
    evaluation, we will not select the event from the other thread, we
    only care about events from the thread that is evaluating the
    condition - and the resume for this thread was never sent to the
    target.
    
    And so, GDB hangs, waiting for an event from a thread that was never
    fully resumed.
    
    To fix this issue I have added the concept of "forcing" the
    commit-resume queue.  When enabling commit resume, if the force flag
    is true, then any resumes will be committed to the target, even if
    there are other threads with pending events.
    
    A note on authorship: this patch was based on some work done by
    Natalia Saiapova and Tankut Baris Aktemur from Intel[1].  I have made
    some changes to their work in this version.
    
    Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28942
    
    [1] https://sourceware.org/pipermail/gdb-patches/2020-October/172454.html
    
    Co-authored-by: Natalia Saiapova <natalia.saiapova@intel.com>
    Co-authored-by: Tankut Baris Aktemur <tankut.baris.aktemur@intel.com>
    Reviewed-By: Tankut Baris Aktemur <tankut.baris.aktemur@intel.com>
    Tested-By: Luis Machado <luis.machado@arm.com>
    Tested-By: Keith Seitz <keiths@redhat.com>