Bug 30041

Summary:	pthread_cancel() hangs under gdb on aarch64
Product:	gdb	Reporter:	Stas Sergeev <stsp>
Component:	shlibs	Assignee:	Not yet assigned to anyone <unassigned>
Status:	UNCONFIRMED ---
Severity:	normal	CC:	drepper.fsp
Priority:	P2
Version:	12.1
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	test case test case test case

Description Stas Sergeev 2023-01-24 14:21:47 UTC

Created attachment 14612 [details]
test case

Under qemu's aarch64 please do the following:

$ gcc -Wall -ggdb3 tcanc.c
$ ./a.out
1
2
3
Stopping
4
OK

So far so good.
Now:

$ gdb ./a.out
r
1
2
3
Stopping
4
5
6
[ counting continues infinitely - main thread stuck in pthread_cancel() ]
[ lets disable SIGALRM just to make sure the hang is permanent ]
^C
Thread 1 "a.out" received signal SIGINT, Interrupt.
__GI__dl_debug_state () at ./elf/dl-debug.c:117
117	./elf/dl-debug.c: No such file or directory.
(gdb) handle SIGALRM nopass
Signal        Stop	Print	Pass to program	Description
SIGALRM       No	No	No		Alarm clock
(gdb) c
Continuing.
^C
Thread 1 "a.out" received signal SIGINT, Interrupt.
__GI__dl_debug_state () at ./elf/dl-debug.c:117
117	in ./elf/dl-debug.c
(gdb) 
[ yes, the hang is permanent, it won't advance even w/o SIGALRM ]


This hang doesn't depend on a SIGALRM rate, i.e. SIGALRM
doesn't drain the CPU power, the rate in a test-case is
actually rather low. But SIGALRM is a needed "ingredient",
i.e. w/o SIGALRM the hang is not reproducible.

Stack trace points to some dlopen/unwind games, so I suspect
its a glibc bug. But if not - maybe its a gdb bug?

Comment 1 Andreas Schwab 2023-01-24 14:52:25 UTC

I cannot reproduce that, with or without gdb.

Comment 2 Stas Sergeev 2023-01-24 15:25:44 UTC

(In reply to Andreas Schwab from comment #1)
> I cannot reproduce that, with or without gdb.

Are you under qemu?
I use kinetic-server-cloudimg-arm64.img ubuntu
with all updates, and "-cpu cortex-a57 -M virt".
Not sure what else is helpful, maybe you want
an ssh to my vm?

Comment 3 Stas Sergeev 2023-01-24 15:34:37 UTC

gdb is 12.1-3ubuntu2
What's yours?
If its a gdb problem, then we
first need to sync up the gdb
version.

Comment 4 Andreas Schwab 2023-01-24 15:41:27 UTC

I have tested it on real hardware.

Comment 5 Stas Sergeev 2023-01-24 15:55:22 UTC

(In reply to Andreas Schwab from comment #4)
> I have tested it on real hardware.

OK it seems then you need to raise the
SIGALRM freq. Please change line 35 and
replace the value 4000 with eg 500.
That way it actually even hangs w/o gdb,
but the behavior seems more random, i.e.
now it also hangs in pthread_join().

Comment 6 Stas Sergeev 2023-01-24 18:20:45 UTC

Created attachment 14613 [details]
test case

So I lowered the tick interval in
a hope for reproducing on a real HW.
But I can't promise, maybe you need
to lower even more. That freq allows
the repro w/o gdb, which is already
better.

Comment 7 Andreas Schwab 2023-01-25 09:57:49 UTC

If anything this is a bug in the debugger.  There are two concurrent types of events, the signal and the shlib events, and the constant flow of signal events prevents the shlib event from making forward progress.

Comment 8 Stas Sergeev 2023-01-25 10:17:20 UTC

(In reply to Andreas Schwab from comment #7)
> If anything this is a bug in the debugger.  There are two concurrent types
> of events, the signal and the shlib events, and the constant flow of signal
> events prevents the shlib event from making forward progress.

Yes, this seems to be the case.
I modified the test so that the
second thread disables the timer
after some time. If pthread_cancel()
was hanging, it unstucks.
If pthread_join() was hanging, it
doesn't unstuck because actually
the second thread is already terminated
so the timer shut-down doesn't happen.

But I thought I excluded such a possibility
by at least 2 things:
- attaching with gdb and doing "handle SIGALRM nopass"
- lowering the SIGALRM rate and making
sure both threads can execute code and
print things.

So I still don't understand what's
going on. If both threads could
sleep() and printf() relatively happily
under the much higher SIGALRM rate, then
why some rather small SIGALRM rate still
causes pthread_cancel() to stall indefinitely?
Its not like anything else stalls.
In fact, I discovered that effect on a
real program of mine, which works perfectly
(and is used by people) under the exact
SIGALRM rate which causes the full stall of
pthread_cancel().
So how is that possible w/o a bug?

Comment 9 Andreas Schwab 2023-01-25 10:25:01 UTC

Telling the debugger not to forward the signal does not change the overhead of signal delivery through the debugger.  You are still stuck in the shlib event.  The only way to prevent the overhead of the shlib event is to make sure libgcc_s is already loaded by the time pthread_cancel is called.

Comment 10 Stas Sergeev 2023-01-25 10:29:09 UTC

Created attachment 14624 [details]
test case

Here's the updated test-cases that
shows that both threads are alive
and kicking before pthread_cancel().
After pthread_cancel() - either both
stuck forever, or until the second
thread shuts down the timer.

> You are still stuck in the shlib event.

But could you please explain a bit
of a details? If both threads could
progress, then why "shlib event" can't?
How is it different from the prints
that I have now inserted into the test
to make sure SIGALRM doesn't out-hog CPU?

> libgcc_s is already loaded by the time pthread_cancel is called.

Wow! Then nothing would stall?
How can I do that?

Comment 11 Andreas Schwab 2023-01-25 10:42:43 UTC

Only one thread progresses.  The other is stuck in the shlib event.

Comment 12 Stas Sergeev 2023-01-25 10:57:32 UTC

(In reply to Andreas Schwab from comment #11)
> Only one thread progresses.  The other is stuck in the shlib event.

So you mean gdb can't handle shlib
event because of SIGALRMs?
So is it a gdb bug which doesn't
stop signals while performing the
shlib event?
You told about gdb from the very
beginning, but only now I am starting
to understand what "shlib event" do
you mean.

Comment 13 Stas Sergeev 2023-01-25 12:47:39 UTC

So if I understand you correctly,
this should be re-assigned to gdb.