[PATCH v2 5/6] gdb, infrun: fix silent inferior switch in do_target_wait()

Wed Oct 23 11:52:04 GMT 2024

"Metzger, Markus T" <markus.t.metzger@intel.com> writes:

> Hello Andrew,
>
>>> In do_target_wait(), we iterate over inferiors and call
>>> do_target_wait_1(), which eventually calls target_wait() per inferior.
>>> Each time, we wait for minus_one_ptid.
>>
>>Could you expand on this a little more please as I don't understand what
>>you are telling me, and I can't align this explanation with the code.
>
> This used to be minus_one_ptid until you changed it (back) to wait_ptid in
> 07505b613a0605ef42a004364ab8ccb70fd3c8eb.
>
> There is a single call of that function that passes minus_one_ptid outside
> of condition evaluation:
>
>     /* If the thread is in the middle of the condition evaluation, wait for
>        an event from the current thread.  Otherwise, wait for an event from
>        any thread.  */
>     ptid_t waiton_ptid = in_cond_eval ? inferior_ptid : minus_one_ptid;
>
>     if (!do_target_wait (waiton_ptid, &ecs, TARGET_WNOHANG))
>       {
> 	infrun_debug_printf ("do_target_wait returned no event");
> 	disable_commit_resumed.reset_and_commit ();
> 	return;
>       }
>
>>So it's not clear to me what you mean when you say 'Each time, we wait
>>for minus_one_ptid'.  Maybe you mean elsewhere in GDB?
>
> I should say 'outside of condition evaluation, we wait for minus_one_ptid'.
>
>
>>> In some cases, e.g. gdb.threads/detach-step-over.exp, we ask to wait for
>>> one inferior, and get an event from a different inferior back without
>>> noticing the inferior switch.
>>
>>The way you mention detach-step-over.exp it seems like you think GDB is
>>doing something wrong.  But the test fully passes for me without this
>>patch, and you don't adjust the expected output.  So I'm guessing that
>>nothing changes for that test.  But that seems at odds with the say you
>>mention the test; my reading of the above is that GDB is doing something
>>wrong.  Could you clarify what's going on please.
>
> It is definitely wrong to go through all the effort of randomly selecting an inferior
> and then iterating over inferiors one-by-one if we then allow switching to a different
> inferior in the actual wait call.
>
>
>>I see that without this patch you do get some failures in the next patch
>>of this series, and my next job is to debug those failures in order to
>>understand this patch more...
>
> The next patch organizes the replay/record state per inferior.  A single target
> may have inferiors that are currently recording and inferiors that are currently
> replaying.  We need to wait for individual inferiors since they are handled by
> different target methods: the replaying ones are handled by record targets and
> the recording ones are handled by the target beneath the record target.
>
> We could handle this in the record target by splitting forwarding the request
> for some threads and handling it itself for others.
>
> Since the infrun do_target_wait() code obviously intends to do the wait per
> inferior and just forgets to adjust the wait focus to one inferior, fixing that
> seemed to be the better approach to me.

I agree with you.  But I do have one concern.  As part of the next
commit I think you should add:

  gdb_assert (ptid != minus_one_ptid);

to the start of `record_btrace_target::wait`.  For the reasons you
explain, this function can't currently wait on all inferiors, if one
inferior requires handling in record_btrace_target while another
inferior requires handling in the target beneath.

Which leads to my concern.  With that assert in place we are, I claim,
making the promise that target_ops::wait() will _never_ be called with
minus_one_ptid, but this code:

    /* Make sure we're not widening WAIT_PTID.  */
    if (!ptid.matches (wait_ptid)
	/* Targets that cannot async will be asked for a blocking wait.

	   Blocking wait does not work inferior-by-inferior if the target
	   provides more than one inferior.  Fall back to waiting for
	   WAIT_PTID in that case.  */
	|| !target_can_async_p () || ((options & TARGET_WNOHANG) == 0)
	/* FIXME: I don't see why we should have inferiors with zero pid,
	   which indicates that the respective ptid is not a process.
	   They do exist, though, and we cannot wait for them.  */
	|| !ptid.is_pid ())
      ptid = wait_ptid;

>From your branch, in infrun.c, would, if I understand it correctly, mean
that ptid could be set to minus_one_ptid in some cases.

So my concern here is: are there situations where
record_btrace_target::wait can end up being called with minus_one_ptid?

Lets be clear, I totally agree with you that for the case where
TARGET_WNOHANG is set, it makes more sense that we wait on each inferior
in turn.  But I wonder if we should still be changing
record_btrace_target::wait so that it can handle minus_one_ptid?

I applied this HACK:

  diff --git a/gdb/record-btrace.c b/gdb/record-btrace.c
  index 7937bb3f8db..367225c3320 100644
  --- a/gdb/record-btrace.c
  +++ b/gdb/record-btrace.c
  @@ -2637,6 +2637,13 @@ record_btrace_target::wait (ptid_t ptid, struct target_waitstatus *status,

     if (moving.empty ())
       {
  +      if ((::execution_direction != EXEC_REVERSE)
  +	  && ptid == minus_one_ptid)
  +	{
  +	  infrun_debug_printf ("forwarding the wait to target beneath");
  +	  return this->beneath ()->wait (ptid, status, options);
  +	}
  +
         *status = btrace_step_no_moving_threads ();

         DEBUG ("wait ended by %s: %s", null_ptid.to_string ().c_str (),

and revert patch 5/6 in your series, and all of your new tests still
pass.  I don't think the above is the correct way to change ::wait, but
it does hint that it might be possible to support minus_one_ptid.

As I said, I do agree with you that do_target_wait could be improved,
but I still need convincing that record_btrace_target::wait is OK as is.

Thanks,
Andrew

>
>
>>... but is there any chance we could write a test that exposes an issue
>>that is fixed by this patch, that could be added just with this patch?
>>
>>> ---
>>>  gdb/infrun.c    | 46 +++++++++++++++++++++++++++++++++++++++++++---
>>>  gdb/linux-nat.c | 17 ++++++++++++-----
>>>  gdb/remote.c    | 22 +++++++++++++++++-----
>>>  3 files changed, 72 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/gdb/infrun.c b/gdb/infrun.c
>>> index a5030b16376..9ca0571065c 100644
>>> --- a/gdb/infrun.c
>>> +++ b/gdb/infrun.c
>>> @@ -4198,7 +4198,23 @@ do_target_wait (ptid_t wait_ptid,
>>execution_control_state *ecs,
>>>
>>>    auto do_wait = [&] (inferior *inf)
>>>    {
>>> -    ecs->ptid = do_target_wait_1 (inf, wait_ptid, &ecs->ws, options);
>>> +    ptid_t ptid { inf->pid };
>>> +
>>> +    /* Make sure we're not widening WAIT_PTID.  */
>>> +    if (!ptid.matches (wait_ptid)
>>
>>Sorry, I just don't understand what's going on here.
>>
>>If wait_ptid is minus_one_ptid then ptid will be for pid -1, and the
>>ptid.matches() call will return true.  But a ptid_t initialised with
>>only a pid of -1 is equal to minus_one_ptid, right?
>
> Right, but this ptid is initialized to inf->pid, not to wait_ptid.pid ().
>
>
>>If wait_ptid is for a process only, with no thread, then ptid ==
>>wait_ptid, and ptid.matches() returns true.
>>
>>If wait_ptid is for a thread then ptid.matches() returns false, and we
>>revert to using wait_ptid.
>>
>>So, if my understanding is correct, ptid always ends up equal to
>>wait_ptid.
>>
>>
>>> +	/* Targets that cannot async will be asked for a blocking wait.
>>> +
>>> +	   Blocking wait does not work inferior-by-inferior if the target
>>> +	   provides more than one inferior.  Fall back to waiting for
>>> +	   WAIT_PTID in that case.  */
>>> +	|| !target_can_async_p () || ((options & TARGET_WNOHANG) == 0)
>>> +	/* FIXME: I don't see why we should have inferiors with zero pid,
>>> +	   which indicates that the respective ptid is not a process.
>>> +	   They do exist, though, and we cannot wait for them.  */
>>
>>I'm not a fan of FIXME comments that don't lay out a clear path for how
>>to fix the issue.  e.g. "FIXME: This can't be changed until GCC 33.1 is
>>released with support for DWARF8" or some such.
>>
>>I cannot offer much insight here, but you do say "They do exist, though,
>>..." so you've clearly tracked down where they come from.  So why not
>>share those details.  Better yet, rewrite the comment as an explanation:
>>
>>  "Process with a zero pid are created by ...... Unfortunately we cannot
>>   wait for a zero pid so revert to using the full wait_ptid."
>
> One such inferior is created by initialize_inferiors ():
>   set_current_inferior (add_inferior_silent (0));
>
> It is strange that the ptid for such an inferior says that this isn't a process.
>
> I changed the comment to:
> 	/* We cannot wait for inferiors without a pid.
>
> 	   One such inferior is created by initialize_inferiors () to
> 	   ensure that there always is an inferior.  */
>
>
>>> +	|| !ptid.is_pid ())
>>> +      ptid = wait_ptid;
>>> +
>>> +    ecs->ptid = do_target_wait_1 (inf, ptid, &ecs->ws, options);
>>>      ecs->target = inf->process_target ();
>>>      return (ecs->ws.kind () != TARGET_WAITKIND_IGNORE);
>>>    };
>>> @@ -4208,6 +4224,12 @@ do_target_wait (ptid_t wait_ptid,
>>execution_control_state *ecs,
>>>       reported the stop to the user, polling for events.  */
>>>    scoped_restore_current_thread restore_thread;
>>>
>>> +  /* The first TARGET_WAITKIND_NO_RESUMED execution state.
>>> +
>>> +     If we do not find a more interesting event, we will report that.  */
>>
>>I think this comment needs rewording.  I had to study the code before
>>this made sense to me.  By default the no_resumed object's `ws` member
>>will be TARGET_WAITKIND_IGNORE, so initially I didn't understand what
>>the TARGET_WAITKIND_NO_RESUMED comment was going on about.
>>
>>I think the comment should explain how the object will be copied into
>>later on when we find a TARGET_WAITKIND_NO_RESUMED result.
>
> Yeah, this is another side-effect of not adjusting the wait scope.  If you're
> waiting for minus_one_ptid, the target will check all its inferiors and only
> return TARGET_WAITKIND_NO_RESUMED if no other thread has anything
> to report.
>
> It's still wrong in the case of multi-target, where we could have an event on
> some other target, yet we still return the TARGET_WAITKIND_NO_RESUMED
> from the first target we checked.
>
> I changed the comment to:
>   /* The first TARGET_WAITKIND_NO_RESUMED execution state.
>
>      We do not want to return TARGET_WAITKIND_NO_RESUMED right away since
>      another inferior may have a more interesting event to report.  If
>      there is no other event to report, after all, we want to report the
>      first such event.
>
>      This variable holds that first event, which will be copied on the
>      first TARGET_WAITKIND_NO_RESUMED below.  */
>
>
>>I'm continuing to review this patch, but answers to some of the above
>>(and a test?) would help a lot.
>
> Here come answers to the above.  I don't have a test that exploits this
> inferior switching.  I noticed it when I debugged gdb.threads/detach-step-over.exp.
> I don't remember why exactly I debugged it.  There were fails, of course, but
> they were not related to the inferior switching.
>
> Thanks,
> Markus.
>
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Sean Fennelly, Jeffrey Schneiderman, Tiffany Doon Silva
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928