Summary: | pthread_cancel() hangs under gdb on aarch64 | ||
---|---|---|---|
Product: | gdb | Reporter: | Stas Sergeev <stsp> |
Component: | shlibs | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | UNCONFIRMED --- | ||
Severity: | normal | CC: | drepper.fsp |
Priority: | P2 | ||
Version: | 12.1 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
test case
test case test case |
I cannot reproduce that, with or without gdb. (In reply to Andreas Schwab from comment #1) > I cannot reproduce that, with or without gdb. Are you under qemu? I use kinetic-server-cloudimg-arm64.img ubuntu with all updates, and "-cpu cortex-a57 -M virt". Not sure what else is helpful, maybe you want an ssh to my vm? gdb is 12.1-3ubuntu2 What's yours? If its a gdb problem, then we first need to sync up the gdb version. I have tested it on real hardware. (In reply to Andreas Schwab from comment #4) > I have tested it on real hardware. OK it seems then you need to raise the SIGALRM freq. Please change line 35 and replace the value 4000 with eg 500. That way it actually even hangs w/o gdb, but the behavior seems more random, i.e. now it also hangs in pthread_join(). Created attachment 14613 [details]
test case
So I lowered the tick interval in
a hope for reproducing on a real HW.
But I can't promise, maybe you need
to lower even more. That freq allows
the repro w/o gdb, which is already
better.
If anything this is a bug in the debugger. There are two concurrent types of events, the signal and the shlib events, and the constant flow of signal events prevents the shlib event from making forward progress. (In reply to Andreas Schwab from comment #7) > If anything this is a bug in the debugger. There are two concurrent types > of events, the signal and the shlib events, and the constant flow of signal > events prevents the shlib event from making forward progress. Yes, this seems to be the case. I modified the test so that the second thread disables the timer after some time. If pthread_cancel() was hanging, it unstucks. If pthread_join() was hanging, it doesn't unstuck because actually the second thread is already terminated so the timer shut-down doesn't happen. But I thought I excluded such a possibility by at least 2 things: - attaching with gdb and doing "handle SIGALRM nopass" - lowering the SIGALRM rate and making sure both threads can execute code and print things. So I still don't understand what's going on. If both threads could sleep() and printf() relatively happily under the much higher SIGALRM rate, then why some rather small SIGALRM rate still causes pthread_cancel() to stall indefinitely? Its not like anything else stalls. In fact, I discovered that effect on a real program of mine, which works perfectly (and is used by people) under the exact SIGALRM rate which causes the full stall of pthread_cancel(). So how is that possible w/o a bug? Telling the debugger not to forward the signal does not change the overhead of signal delivery through the debugger. You are still stuck in the shlib event. The only way to prevent the overhead of the shlib event is to make sure libgcc_s is already loaded by the time pthread_cancel is called. Created attachment 14624 [details] test case Here's the updated test-cases that shows that both threads are alive and kicking before pthread_cancel(). After pthread_cancel() - either both stuck forever, or until the second thread shuts down the timer. > You are still stuck in the shlib event. But could you please explain a bit of a details? If both threads could progress, then why "shlib event" can't? How is it different from the prints that I have now inserted into the test to make sure SIGALRM doesn't out-hog CPU? > libgcc_s is already loaded by the time pthread_cancel is called. Wow! Then nothing would stall? How can I do that? Only one thread progresses. The other is stuck in the shlib event. (In reply to Andreas Schwab from comment #11) > Only one thread progresses. The other is stuck in the shlib event. So you mean gdb can't handle shlib event because of SIGALRMs? So is it a gdb bug which doesn't stop signals while performing the shlib event? You told about gdb from the very beginning, but only now I am starting to understand what "shlib event" do you mean. So if I understand you correctly, this should be re-assigned to gdb. |
Created attachment 14612 [details] test case Under qemu's aarch64 please do the following: $ gcc -Wall -ggdb3 tcanc.c $ ./a.out 1 2 3 Stopping 4 OK So far so good. Now: $ gdb ./a.out r 1 2 3 Stopping 4 5 6 [ counting continues infinitely - main thread stuck in pthread_cancel() ] [ lets disable SIGALRM just to make sure the hang is permanent ] ^C Thread 1 "a.out" received signal SIGINT, Interrupt. __GI__dl_debug_state () at ./elf/dl-debug.c:117 117 ./elf/dl-debug.c: No such file or directory. (gdb) handle SIGALRM nopass Signal Stop Print Pass to program Description SIGALRM No No No Alarm clock (gdb) c Continuing. ^C Thread 1 "a.out" received signal SIGINT, Interrupt. __GI__dl_debug_state () at ./elf/dl-debug.c:117 117 in ./elf/dl-debug.c (gdb) [ yes, the hang is permanent, it won't advance even w/o SIGALRM ] This hang doesn't depend on a SIGALRM rate, i.e. SIGALRM doesn't drain the CPU power, the rate in a test-case is actually rather low. But SIGALRM is a needed "ingredient", i.e. w/o SIGALRM the hang is not reproducible. Stack trace points to some dlopen/unwind games, so I suspect its a glibc bug. But if not - maybe its a gdb bug?