[PATCH 12/12] gdb: use two displaced step buffers on amd64/Linux

Simon Marchi simark@simark.ca
Thu Nov 26 21:43:22 GMT 2020


On 2020-11-25 3:56 p.m., Simon Marchi wrote:
> On 2020-11-25 3:07 p.m., Simon Marchi wrote:
>> We still have to figure out why the performance regresses compared to
>> master.  One thing is that we still do an extra resume attempt each
>> time.  We do one resume where the disp step prepare succeeds and one
>> more where it fails.  That's twice as much work as before, where we did
>> one successful prepare and then skipped the next ones.  I'll make an
>> experimental change such that the arch can say "the prepare worked, but
>> now I'm out of buffer", just to see how that improves (or not)
>> performance.
>
> That made just a small difference.  Let's call this new patch E:
>
>  #9 + A + B + C + D + E: 105,727

Ok, I played a bit more with this and found a case where we do more
work that may explain the difference.

handle_signal_stop calls finish_step_over, which calls start_step_over.
Let's say that the displaced step buffer is used by some thread and
another thread either hits the breakpoint, we go into start_step_over to
see if we can start a new step over.  We can't, because the event thread
isn't the one that is using the displaced stepping buffer, the buffer is
still occupied.

In the pre-patch code, that early check in start_step_over:

      /* If this inferior already has a displaced step in process,
	 don't start a new one.  */
      if (displaced_step_in_progress (tp->inf))
	continue;

means that we won't even try one resumption, we'll skip over all threads
in the chain (the only cost is the cost of iteration).

In the patch I previously called "C", start_step_over starts with the
assumption that the inferior has an available buffer, and needs to do at
least one resumption attempt that returns "unavailable" before it
assumes there are no more available.  That one unnecessary resumption
attempt happens over and over and ends up being costly.

I changed my patch C to make the "unavailable" property persistent,
instead of just using it locally in start_step_over.  If a displaced
step prepare returns "unavailable", we mark that this inferior has no
more buffers available, and only reset this flag when a displaced step
for this inferior finishes.  Doing so, when a thread hits the breakpoint
and start_step_over is called, the unavailable flag is already set.
Last time we tried to prepare a displaced step, there were no buffers
available.  And if no displaced step finished for this inferior since,
then why bother asking?  As a result, we don't even try one resume if
the buffers are already take, just like with the pre-patch code.

I re-did some tests with the new patch C:

master:             110,736
9:                   19,973
9 + A + B + C:      107,711
9 + A + B + C + E:  111,135
Everything up to
 using two displaced
 stepping buffers:  110,697

Note that I dropped patch D, which was to break out of the
start_step_over loop when getting "unavailable".  It didn't help much
and wasn't desired functionality anyway.

And for the reminder, patch E is that when returning "_OK", the arch can
also tell us "that was the last one, any more request would return
unavailable".  So that allows to do just one resumption attempt (at
most) per start_step_over invocation, rather than 2.

What this shows is that by using everything up to patch E, we are pretty
much on par with the performance, but with the more flexible
architecture.  That makes sense, as (I think) we do pretty much the same
work.

Re-thinking about patch C and E, I'm thinking that we could let the
architecture's displaced stepping implementation decide when it sets the
"unavailable" flag, instead of the core of GDB setting it.  The
architecture would then choose the policy it wants to adopt:

- set the flag when it used its last buffer (equivalent to patch E)
- set the flag when it returns "_UNAVAILABLE" (equivalent to patch C')
- never set the unavavaible flag (equivalent to patch 9)

It also shows that using two buffers on x86-64 doesn't improve
performance (unless there's another unfortunate bottleneck I haven't
found).  But that's what I expected.  My hypothesis is that the actual
step portion of the displaced step is so fast that we never really use
the buffers in parallel.  If we start two displaced steps, by the time
we prepare the second one, the first one has already finished and is
waiting for GDB to clean it up.  If for some reason we could displaced
step an instruction that took, say, 1 second to execute, then having
many displaced step buffers would be useful, because GDB would be able
to start them all and have them really run in parallel.

I'll prepare a user branch with all those fixup patches so you can take
a look, and if you think it makes sense I'll be ready for a v2.

Simon


More information about the Gdb-patches mailing list