[PATCH 09/12] gdb: move displaced stepping logic to gdbarch, allow starting concurrent displaced steps

Thu Nov 26 14:24:33 GMT 2020

On 11/25/20 7:29 PM, Simon Marchi wrote:
> On 2020-11-24 8:40 p.m., Pedro Alves wrote:
> 
>>> +
>>> +  m_original_pc = regcache_read_pc (regcache);
>>> +  displaced_pc = m_addr;
>>> +
>>> +  /* Save the original contents of the displaced stepping buffer.  */
>>> +  m_saved_copy.resize (len);
>>> +
>>> +  int status = target_read_memory (m_addr, m_saved_copy.data (), len);
>>> +  if (status != 0)
>>> +    throw_error (MEMORY_ERROR,
>>> +		 _("Error accessing memory address %s (%s) for "
>>> +		   "displaced-stepping scratch space."),
>>> +		 paddress (arch, m_addr), safe_strerror (status));
>>> +
>>> +  displaced_debug_printf ("saved %s: %s",
>>> +			  paddress (arch, m_addr),
>>> +			  displaced_step_dump_bytes
>>> +			    (m_saved_copy.data (), len).c_str ());
>>> +
>>> +  m_copy_insn_closure = gdbarch_displaced_step_copy_insn (arch,
>>> +							  m_original_pc,
>>> +							  m_addr,
>>> +							  regcache);
>>> +  if (m_copy_insn_closure == nullptr)
>>> +    {
>>> +      /* The architecture doesn't know how or want to displaced step
>>> +        this instruction or instruction sequence.  Fallback to
>>> +        stepping over the breakpoint in-line.  */
>>> +      return DISPLACED_STEP_PREPARE_STATUS_ERROR;
>>> +    }
>>> +
>>> +  try
>>> +    {
>>> +      /* Resume execution at the copy.  */
>>> +      regcache_write_pc (regcache, m_addr);
>>> +    }
>>> +  catch (...)
>>
>> I'd rather we don't use 'catch (...)' throughout gdb.  It swallows
>> too much.  We should catch gdb_exception_error in most cases or
>> gdb_exception when we also want to catch Ctrl-C/QUIT.  If something
>> throws something else, I'd rather not swallow it, since it's most
>> probably a bug.
>>
>> I'm a bit confused on error handling here.  Before the patch, we used
>> displaced_step_reset_cleanup, but didn't swallow the error.  After the patch,
>> the exception is swallowed and DISPLACED_STEP_PREPARE_STATUS_ERROR is
>> returned.  Was that on purpose?
> 
> Hmm, I probably did this to ensure that any error would be reported by
> returning a status code, rather than some errors returning _ERROR and
> some errors throwing an exception.  So just for consistency.
> 

The function still throws errors anyway, e.g. a bit above:

  int status = target_read_memory (m_addr, m_saved_copy.data (), len);
  if (status != 0)
    throw_error (MEMORY_ERROR,
		 _("Error accessing memory address %s (%s) for "
		   "displaced-stepping scratch space."),
		 paddress (arch, m_addr), safe_strerror (status));

That's then caught by displaced_step_prepare, which results in
the "disabling displaced stepping" warning, and GDB disabling
displaced stepping for the inferior.  

Also, NOT_SUPPORTED_ERROR is handled the same way.  See:

 commit 16b4184277c4ad5b4a20278060fd3f6259d1ed49
 Author:     Pedro Alves <palves@redhat.com>
 AuthorDate: Tue Mar 15 16:33:04 2016 +0000

    Fix PR gdb/19676: Disable displaced stepping if /proc not mounted

Any other error is just propagated out, likely all the way to
the prompt:

 static displaced_step_prepare_status
 displaced_step_prepare (thread_info *thread)
 {
 ...
   catch (const gdb_exception_error &ex)
     {
      struct displaced_step_inferior_state *displaced_state;

      if (ex.error != MEMORY_ERROR
	  && ex.error != NOT_SUPPORTED_ERROR)
	throw;

> Would it make sense to remove the _ERROR status code and report all
> errors using exceptions then?

I don't think so, because there are 3 kinds of "errors":

 #1 - MEMORY and NOT_SUPPORTED errors result in disabling displaced stepping
      for the inferior permanently.

 #2 - other errors/exceptions propagate outwards.

 #3 - "prepare" returning "Can't do displace step for this instruction" for
      some reason.  Like the backend not supporting displaced stepping
      that particular instruction, or the breakpoint address falling
      within the scratch pad.  I.e., these cases:

...
   if (breakpoint_in_range_p (aspace, m_addr, len))
    {
      /* There's a breakpoint set in the scratch pad location range
	 (which is usually around the entry point).  We'd either
	 install it before resuming, which would overwrite/corrupt the
	 scratch pad, or if it was already inserted, this displaced
	 step would overwrite it.  The latter is OK in the sense that
	 we already assume that no thread is going to execute the code
	 in the scratch pad range (after initial startup) anyway, but
	 the former is unacceptable.  Simply punt and fallback to
	 stepping over this breakpoint in-line.  */
      displaced_debug_printf ("breakpoint set in scratch pad.  "
			      "Stepping over breakpoint in-line instead.");

      return DISPLACED_STEP_PREPARE_STATUS_ERROR;
    }
...
   m_copy_insn_closure = gdbarch_displaced_step_copy_insn (arch,
 							  m_original_pc,
 							  m_addr,
 							  regcache);
   if (m_copy_insn_closure == nullptr)
     {
       /* The architecture doesn't know how or want to displaced step
         this instruction or instruction sequence.  Fallback to
         stepping over the breakpoint in-line.  */
       return DISPLACED_STEP_PREPARE_STATUS_ERROR;
     }

The "failed to write pc" try/catch case seems debatable, but I'd
lean on it being case #2.

I think that case #3 calls for renaming DISPLACED_STEP_PREPARE_STATUS_ERROR
to something else to avoid confusion.  Note that in master, -1 wasn't
documented in terms of error:

...
   Returns 1 if preparing was successful -- this thread is going to be
   stepped now; 0 if displaced stepping this thread got queued; or -1
   if this instruction can't be displaced stepped.  */

 static int
 displaced_step_prepare_throw (thread_info *tp)

>>> +  /* Steal the global thread step over chain.  */
>>
>> It would be good expand the command explaining _why_ steal.
> 
> How about:
> 
>   /* Steal the global thread step over chain.  As we try to initiate displaced
>      steps, threads will be enqueued in the global chain if no buffers are
>      available.  If we iterated on the global chain directly, we might iterate
>      indefinitely.  */

Sounds good.

> 
>>> @@ -2033,7 +1945,30 @@ start_step_over (void)
>>>  	 displaced step on a thread of other process. */
>>>      }
>>>
>>> -  return false;
>>> +    /* If there are threads left in the THREADS_TO_STEP list, but we have
>>> +       detected that we can't start anything more, put back these threads
>>> +       in the global list.  */
>>
>> Do we also need to do this if an exception happens to escape the function?
>> We might end up pretty bonkers anyhow if that happens, though...
> 
> Yes, I realized that when playing with this code again yesterday.  A
> scope_exit would help for this I think.
> 
> But I'm also wondering if we should enqueue back the thread TP that was
> dequeued, that caused an error.  This one has already been dequeued from
> THREADS_TO_STEP.  For example, if thread_still_needs_step_over throws
> (it reads the PC, so that could happen), should we put back the thread
> in the global chain?  If we do, and the same error happens over and
> over, the thread will never leave the step over chain and be the first
> in line, preventing any other displaced step to happen.
> 
> If, on the other hand, the error was just an intermittent one and we
> don't put it back in the queue, the next time we'll resume the thread
> we'll realize it needs a step over and enqueue it again.  So it sounds
> less risky to me to just not enqueue back the thread on error.  I'm
> really not sure, I find it difficult to reason about all the possible
> cases.

Yeah, not enqueuing sounds OK.  And indeed, error handling in all of
infrun is very hard to reason about, and likely to end up with things
messed up, e.g., with a thread marked as running when it is really
stopped.

> 
> Also, re-reading that code, I noticed that we only return "true" when an
> inline step over is started, or if a displaced step on an all-stop
> target was started.  That's the case even before the patch.  I suppose
> that's on purpose, because the caller wants to know whether it can
> resume more stuff or not?  

Yes.  If it returns true, then the caller shouldn't resume anything
and go straight to waiting.  ISTR having some trouble coming up with
the right conditions at the callers.  Maybe it could be simplified.
I no longer remember why start_step_over returns false here:

  /* Don't start a new step-over if we already have an in-line
     step-over operation ongoing.  */
  if (step_over_info_valid_p ())
    return false;

Off hand, it seems like that should return true.  But it may
be that the intention was to only return true if anything
that was stopped before was started.  I'd need to experiment
to know better now.  :-/

> If so, I think we should update
> start_step_over's documentation:
> 
>   /* Are there any pending step-over requests?  If so, run all we can
>      now and return true.  Otherwise, return false.  */
> 

>>> @@ -5293,25 +5226,23 @@ handle_inferior_event (struct execution_control_state *ecs)
>>>        {
>>>  	struct regcache *regcache = get_thread_regcache (ecs->event_thread);
>>>  	struct gdbarch *gdbarch = regcache->arch ();
>>> +	inferior *parent_inf = find_inferior_ptid (ecs->target, ecs->ptid);
>>> +
>>> +	if (ecs->ws.kind == TARGET_WAITKIND_FORKED)
>>> +	  {
>>> +	    /* Restore in the child process any displaced stepping buffers that
>>> +	       were in use at the time of the fork.  */
>>> +	    gdbarch_displaced_step_restore_all_in_ptid
>>> +	      (gdbarch, parent_inf, ecs->ws.value.related_pid);
>>> +	  }
>>
>> Why was this moved out of the displaced_step_in_progress_thread check
>> below?
> 
> If:
> 
> 1. thread 1 is doing a displaced step
> 2. thread 2 does a fork, not letting time for thread 1 to complete
> 
> event_thread is thread 2, and it is not doing a displaced step, so we
> don't enter the
> 
>   if (displaced_step_in_progress_thread (ecs->event_thread))
> 
> But we still want to restore the bytes in the child used as the
> displaced stepping buffer for thread 1.  Does that make sense?  Is it
> possible that the current code is not correct in that regard?

Indeed, looks like it.  Good catch.  How about fixing that
in a separate patch, then, with its own commit log?

>>> +int
>>> +thread_step_over_chain_length (thread_info *tp)
>>> +{
>>> +  if (tp == nullptr)
>>> +    return 0;
>>> +
>>> +  int num = 1;
>>
>> Should we add:
>>
>>   gdb_assert (thread_is_in_step_over_chain (tp));
>>
>> ?
>>
>> But then again, it's a bit odd to allow tp == nullptr,
>> but not allow tp->step_over_next == nullptr?
> 
> Well, a pointer to an empty thread step over chain is a nullptr: when
> the global chain is empty, global_thread_step_over_chain is nullptr.
> Same for threads_to_step that is local to start_step_over.  So
> it makes sense that we can do thread_step_over_chain_length(nullptr).
> 
> However, it would be a mistake to pass a non-nullptr pointer to a thread
> that is not in a step over chain, so I think it would be a good idea to
> add the assert you propose.
> 
> I will clarify the documentation of the function:
> 
>   /* Return the length of the the step over chain TP is in.
> 
>      If TP is non-nullptr, the thread must be in a step over chain.
>      TP may be nullptr, in which case it denotes an empty list, so a length of
>      0.  */
> 

OK.