[Bug gdb/22992] GDB and Microsoft Windows thread pool

palves at redhat dot com sourceware-bugzilla@sourceware.org
Fri Oct 4 08:50:00 GMT 2019


https://sourceware.org/bugzilla/show_bug.cgi?id=22992

--- Comment #15 from Pedro Alves <palves at redhat dot com> ---
(In reply to Tom Tromey from comment #14)
> (In reply to Pedro Alves from comment #13)
> 
> > And supposedly, the breakpoint is NOT installed in the target at this point,
> > right?  Or is it some other breakpoint?  Seems like it's the same from your
> > logs.  If it is indeed the breakpoint that was removed, then that again
> > suggests this is Windows returning the queued event.
> 
> As I understand it, gdb itself uninstalls the breakpoints before
> single-stepping.
> I did not actually verify this.

Not all breakpoints, just the breakpoint being stepped over.  Hence my
question, is this always an event for that breakpoint that is being stepped
over, thus indicating that it must be a queued event, or could it be an event
for some other breakpoint, indicating that the other threads actually ran even
when supposedly suspended?  This makes me realize that a better test would be
to look at the PCs of all the threads, make sure none of them changed, except
the thread that is being stepped.  If none moves, then this really must be just
a queued event.

> 
> The only user breakpoint in my setup is the "dprintf" one (see the original
> repro instructions in this bug).
> 
> > SuspendThread returns the thread's previous suspend count.  You could use
> > that to check whether the program is unsuspending threads on its own.
> 
> Yes, I did this at some point.  In fact what I did is hack gdb to suspend
> each thread 5 times, then also print out the suspension count after getting
> a debug event.  All the relevant threads always reported their suspension
> count as 5 -- which to my mind means it is unlikely that the inferior is
> doing
> anything tricky.
> 
> > > I am thinking it might be ok to simply ignore such events.
> > 
> > Does Windows decrement the PC after a break automatically?  I forget.
> 
> There's no mention of this in windows-nat.c.  So I guess it defers to the
> arch?

By "Windows", I mean the system/kernel, before reporting the event.  On x86, a
breakpoint triggers, the PC will be pointing to the instruction after the
breakpoint instruction, int3 / 0xcc, which is a one-byte-long instruction,
since a breakpoint trigger is just the int3 instruction executing.  So when a
breakpoint instruction runs / triggers, something must rewind the PC back one
byte, so that it points at the start of the original instruction again. 
Traditionally, this is done by gdb core, by infrun.c:adjust_pc_after_break.
If the target / system / kernel itself does the PC rewinding before reporting
the event, then we should skip adjust_pc_after_break and make the target return
true for target_supports_stopped_by_sw_breakpoint along with implementing
target_stopped_by_sw_breakpoint.  linux-nat.c and gdbserver/linux-low.c do that
nowadays.

> 
> 
> I changed my gdb to (1) suspend threads when stepping (still seems like a big
> oversight to me ... and despite what I said earlier, I'm not 100% sure that
> gdbserver does this either), and (2) ignore spurious breakpoint events.

Indeed seems like an oversight if true.  Tests like gdb.threads/schedlock.exp
should expose this.  ISTR that they pass when testing with Cygwin, but that was
a long time ago.  thread_rec suspends the thread, it may be that the code is
assuming that core gdb fetches registers from all threads all all debug events
or something.  Which is a really bad assumption.

You should be able to check this by confirming that "set scheduler-locking on"
really leaves other threads suspended, but, if you do "info threads", you'll
fetch registers from all threads, thus masking the problem, since fetching
registers ends up in thread_rec -> SuspendThread for all threads, I think.

> 
> Now I get even weirder behavior:
> 
> Thread 4 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 7664.0x32c8]
> 0x004015dc in (anonymous namespace)::Callback (context=0xe0) at sample.cxx:22
> 22	  Sleep(100);  /* Do a "work". */
> 
> Look at that PC, then:
> 
> (gdb) disassemble
> Dump of assembler code for function (anonymous
> namespace)::Callback(TP_CALLBACK_INSTANCE*, void*, TP_WORK*):
>    0x004015d0 <+0>:	push   %ebp
>    0x004015d1 <+1>:	mov    %esp,%ebp
>    0x004015d3 <+3>:	sub    $0x18,%esp
>    0x004015d6 <+6>:	movl   $0x64,(%esp)
>    0x004015dd <+13>:	mov    0x4091fc,%eax
> 
> ... notice that it is in the middle of an instruction.

This is exactly the sort of thing to expect if Windows itself doesn't unwind
the PC for us.

The breakpoint was originally set at 0x4015d6.  So when the breakpoint
triggers, the PC will be pointing at 0x4015d7.  If nothing rewinds the PC back
to 0x4015d6, then when the thread is resumed, it'll execute the instruction at
0x4015d7.  Sometimes that crashes immediately for trying to run an invalid
instruction, but in this case looks like that slice of the original instruction
was a valid 2-byte instruction, which executes and manages to not crash the
program.  Then the next "instruction" as at 0x004015dc, and that one crashes.
I.e., the program's PC pointer is no longer pointing at the real instruction
boundaries.

Even if Windows did the PC rewinding, it would seem to me that it would be a
bad idea to ignore the events.  For breakpoints, ignoring is kind of fine since
as soon as you resume the thread, and the breakpoint is still installed, the
thread will just trigger the breakpoint again.  But for other events, like for
example watchpoints, that won't be true, if you ignore such events, you'll just
lose them.

So it's better to queue these events in gdb too.

> 
> Maybe the queueing approach is better and we should just live with the
> spurious stop.
> If the theory is that this is caused by internal event buffering in Windows,
> then I
> would definitely argue this direction.

The queueing approach is better, but that should _not_ result in spurious
stops.
We do queueing in linux-nat.c too, for example.  

You either 

- #1 queue all events except breakpoints, and for breakpoints unwind PC and
discard the event, or,

- #2 queue all events, for breakpoints unwind PC, and implement
target_stopped_by_sw_breakpoint (this requires unwound PCs)

#1 makes threads retrigger delayed breakpoint events, possibly over and over in
a contended scenario, worsening contention.  gdb/linux-nat.c and
gdbserver/linux-low.c did this for many years.

#2 lets core GDB know to ignore a spurious breakpoint event if the breakpoint
is removed between a thread hitting it and the queued event bubbling up to gdb
core.  Without target_stopped_by_sw_breakpoint, such an delayed event would
just look like a spurious SIGTRAP, and thus reported as such to the user.

I'd go for #2.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Gdb-prs mailing list