Bug 23616 - Unknown signal error from gdb when stepping over musl libc function call using SIGSYNCCALL
Summary: Unknown signal error from gdb when stepping over musl libc function call usin...
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: gdb (show other bugs)
Version: 8.0.1
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-08 12:29 UTC by Shahar Valiano
Modified: 2020-12-07 20:45 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2020-12-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Shahar Valiano 2018-09-08 12:29:04 UTC
In applications linked against musl libc, stepping over calls to the 'setrlimit' library function result with an error from gdb:

    Thread 1 "myprog" recieved signal ?, Unknown signal.
    __cp_end () at src/thread/x86_64/syscall_cp.s:29

Then, it's not possible to continue the debug session beyond that point.

This issue was seen on Alpine Linux, but may be applicable to apps linked against musl libc on other Linux distros as well. Specifically, it is seen when debugging certain versions of OpenJDK and CoreCLR (links 1, 2) (in which the issue is especially troubling since setrlimit is called during VM startup).

The issue stems from the musl libc implementation of setrlimit (link 3). It updates threads in a synchronized manner by calling __synccall (link 4), which signals the threads with a SIGSYNCCALL signal:

    r = -__syscall(SYS_tgkill, pid, tid, SIGSYNCCALL);

SIGSYNCCALL is internal to musl and doesn't seem to be recognized by gdb. When stepping over this code line, gdb intercepts the SIGSYNCCALL signal and reports the "Unknown signal" error.

This signal is defined as follows in musl pthread_impl.h (link 5), along with two other signal types:

    #define SIGTIMER 32
    #define SIGCANCEL 33
    #define SIGSYNCCALL 34

Adding support for these signal types in gdb, at least avoiding the mentioned error, will enable better debugging of OpenJDK and other apps on Alpine Linux.

Tested with:
- Alpine Linux V3.8
- OpenJDK (openjdk7 / openjdk8 package, but any OpenJDK version should be applicable)
- gdb versions 8.0.1-r6, 8.0.1-r3, 7.12.1-r1.

Reproduction:
- gdb <path/to/java>
- r -version

Links:

[1] https://github.com/dotnet/coreclr/issues/7487
[2] https://stackoverflow.com/questions/52119176/gdb-debugging-openjdk-java-on-alpine-linux-fails-with-thread-recieved-signal
[3] http://git.musl-libc.org/cgit/musl/tree/src/misc/setrlimit.c
[4] http://git.musl-libc.org/cgit/musl/tree/src/thread/synccall.c
[5] http://git.musl-libc.org/cgit/musl/tree/src/internal/pthread_impl.h
Comment 1 Shahar Valiano 2019-04-11 13:03:03 UTC
I've implemented a quick and dirty patch for musl signals, for reference:

https://github.com/shaharv/binutils-gdb/commit/0ca9c66889bdc9558622a92f96a86552fa701924

The trouble is, the three mentioned musl signals are internal and not defined in the user facing <signals.h>, so I had to define them locally. There's also no __MUSL__ define, so another difficulty is ifdef'ing these defines to be for musl builds only.
Comment 2 Rich Felker 2019-10-15 19:48:25 UTC
Rather than hard-coding implementation internals (which will change; SIGTIMER is slated to be removed at some point and the others moved to free up a slot), a clean patch should just handle "unknown" signals in some safe way (i.e. not get stuck on them). Do you understand the mechanism of how this problem is even happening? Presumably it's a weird mixup between host and target -- it should be possible to debug local glibc-linked inferiors with a musl-hosted gdb, or vice versa, so I don't understand how ideas of the semantics of implementation-internal signals are coming into this.
Comment 3 Rich Felker 2019-10-17 02:08:19 UTC
Reportedly this fixes the issue. I don't entirely understand the mechanism, but it seems plausible that this is closer to correct.

https://raw.githubusercontent.com/smaeul/portage/7d7bdaa41e30a2ac8c1ae38d21c8d55126c1b078/patches/sys-devel/gdb/gdb-7.12-signals.patch
Comment 4 Simon Marchi 2020-12-07 20:45:55 UTC
From what I understand, GDB (when debugging natively, replace with GDBserver for the remote case, but I'll just use GDB for simplicity) assumes that it sees the same signal numbers as the program it debugs.  For all I know this is always true, since it runs on the kernel as the program it debugs.  If we couldn't rely on that, i.e. if signal numbers in the inferior could be different than the signal numbers as seen by GDB, then I suppose we would need to bake in GDB the knowledge of all signal numbers for all platform.  Or, there would need to be some sort of debug API for GDB to ask "what does this signal mean in the context of this process"?

This real-time signal issue looks like a bug in GDB.  Basically, gdbsupport/signals.cc wants to know what is the min and max real-time signal numbers provided by the OS, not the min/max shown by the libc.  It doesn't care that 2 or 3 real-time signals are reserved by the libc.  This is why it tries to use __SIGRTMIN if possible: on glibc that passes through the SIGRTMIN value exposed by the kernel.  Since musl doesn't have __SIGRTMIN, GDB falls back on using SIGRTMIN (35 with musl) so it thinks that the lower real-time number offered by the OS is 35, which is wrong.  GDB should see 32 here.  Falling back on SIGRTMIN may be ok for systems where the libc doesn't reserve any signal numbers for itself, but that's not right here.

There is already the intention of providing __SIGRTMIN when compiling on Linux with a libc that doesn't provide __SIGRTMIN:

https://github.com/bminor/binutils-gdb/blob/13f11b0b61ca2620611b08eeaece0ce62c862f4b/gdb/nat/linux-nat.h#L29-L32

But this is currently just used to make the gdb/linux-nat.c code (which uses __SIGRTMIN) build on those platforms.  gdb/nat/linux-nat.h isn't included by gdbsupport/signals.cc.

So, clearly, there is something to fix in GDB so that gdbsupport/signals.cc sees the SIGRTMIN value as provided by the kernel.  But I can't think of a nice solution right now.