Seems like `print_ubacktrace_fileline` is broken on Ubuntu 20 x86_64 (kernel Linux ubuntu20-pkg 5.4.0-24-generic #28-Ubuntu SMP Thu Apr 9 22:16:42 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux) It can be reproduced with the following minimal example using the latest systemtap git master (commit 9102da049a): First prepare the target C program: ```C int main(void) { return 0; } ``` Compile it to generate executable file `./a.out`: ``` gcc -g a.c ``` Prepare the following simple .stp file: ``` function foo() { print_ubacktrace_fileline(); } probe process.function("main") { foo(); exit(); } ``` And run both: ``` $ stap a.stp -c ./a.out WARNING: Kernel function symbol table missing [man warning::symbols] 0x55cfc28cd131 : main+0x8/0x17 [/home/agentzh/git/systemtap-plus/a.out] 0x7f59d5d7d0b3 [/usr/lib/x86_64-linux-gnu/libc-2.31.so+0x270b3/0x1f2000] WARNING: Missing unwind data for a module, rerun with 'stap -d /usr/lib/x86_64-linux-gnu/libc-2.31.so' ``` No file name or line numbers in the output. (The warnings can be safely ignored since they are not relevant here). For comparison on CentOS 6, for example, it works fine using the same version of stap: ``` $ stap a.stp -c ./a.out 0x4004b6 : main+0x4/0xe at /home/agentzh/git/systemtap/a.c:2 [/home/agentzh/git/systemtap/a.out] 0x7fe4d3140d20 [/lib64/libc-2.12.so+0x1ed20/0x394000] WARNING: Missing unwind data for a module, rerun with 'stap -d /lib64/libc-2.12.so' ```
When trying this on x86_64 RHEL8, RHEL9, and Fedora 39 with a current git checkout of systemtap (023ec371b183c1) the output included the file and line number: $ sudo ../install/bin/stap a.stp -c /home/wcohen/systemtap_write/systemtap/a.out 0x40110a : main+0x4/0xe at /home/wcohen/systemtap_write/systemtap/a.c:2 [/home/wcohen/systemtap_write/systemtap/a.out] 0x7f26492ee14a [/usr/lib64/libc.so.6+0x2814a/0x1e2000] WARNING: Missing unwind data for a module, rerun with 'stap -d /usr/lib64/libc.so.6' However, on x86_64 unbuntu 20.04 I do see the issue reported with the same git checkout of systemtap: $ sudo ../install/bin/stap a.stp -c /home/william/systemtap_write/systemtap/a.out 0x55facda6a131 : main+0x8/0x17 [/home/william/systemtap_write/systemtap/a.out] 0x7f3804b72083 [/usr/lib/x86_64-linux-gnu/libc-2.31.so+0x24083/0x1f2000] WARNING: Missing unwind data for a module, rerun with 'stap -d /usr/lib/x86_64-linux-gnu/libc-2.31.so' On RHEL8 dwarf4 is being used and on RHEL9/Fedora 39 dwarf 5. Ubuntu 20.04 is using dwarf 4. I was able to run the unbuntu generated a.out on f39. The output is missing the expected filename and line number. Using "llvm-objdump --line-numbers <object_file>" on f39 it appears that llvm-objdump can find the line numbers: f39 generated binary: 0000000000401106 <main>: ; main(): ; /home/wcohen/systemtap_write/systemtap/a.c:1 401106: 55 pushq %rbp 401107: 48 89 e5 movq %rsp, %rbp ; /home/wcohen/systemtap_write/systemtap/a.c:2 40110a: b8 00 00 00 00 movl $0x0, %eax ; /home/wcohen/systemtap_write/systemtap/a.c:3 40110f: 5d popq %rbp 401110: c3 retq ubuntu 20.04 generated binary: 0000000000001129 <main>: ; main(): ; /home/william/systemtap_write/systemtap/a.c:1 1129: f3 0f 1e fa endbr64 112d: 55 pushq %rbp 112e: 48 89 e5 movq %rsp, %rbp ; /home/william/systemtap_write/systemtap/a.c:2 1131: b8 00 00 00 00 movl $0x0, %eax ; /home/william/systemtap_write/systemtap/a.c:3 1136: 5d popq %rbp 1137: c3 retq 1138: 0f 1f 84 00 00 00 00 00 nopl (%rax,%rax)
I think I can reproduce this, have a stap workaround, and have a theory why this happens. I've been creating my test binaries on two systems. One is "Ubuntu 22.04.3 LTS (Jammy Jellyfish) / gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0" the other is "Fedora Linux 40 (Rawhide Prerelease) / gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4)". I've created the test binaries exactly as the reporter shows in Comment #0. Apparently print_ubacktrace_fileline() fails with ubuntu-built binary, and works fine with fedora binary. I've found, that the following stap runtime update tricks systemtap into printing the expected output with the ubuntu binary: $ git diff diff --git a/runtime/sym.c b/runtime/sym.c index 595871bc6..6db174304 100644 --- a/runtime/sym.c +++ b/runtime/sym.c @@ -634,7 +634,7 @@ unsigned long _stp_linenumber_lookup(unsigned long addr, struct task_struct *tas if (commit_row) { // compare the whole range from the prior committed row // (except an end_sequence can't be the base) - if (row_end_sequence == 0 && row_addr <= addr && addr < curr_addr) + if (row_end_sequence == 0 && row_addr <= addr) { if (need_filename) { $ That said, in case of the ubuntu binary either `curr_addr` is too low, or the `addr is too high. Adding some debug prints to the stap runtime, and comparing curr_addr values against objdump of the said binaries suggests, that the `curr_addr` values look rather sane. What looks suspicious to me is how high the `addr` is with the ubuntu binary. Side-by-side comparison of `readelf -a` for the said binaries shows that they are of different types: [root@rawh build]# readelf -a a.out.fedora | fgrep Type | head -1 Type: EXEC (Executable file) [root@rawh build]# readelf -a a.out.ubuntu | fgrep Type | head -1 Type: DYN (Position-Independent Executable file) [root@rawh build]# I've confirmed that adding -fPIE -pie to the GCC command line on fedora creates a binary that print_ubacktrace_fileline() fails to work with. That said, I reasonably suspect that PIE is a possible root cause.
Created attachment 15302 [details] possible patch (unpolished) I've been looking at commits related to PIE and based on one of them I've put together update that seems to help.
Interesting is that the Ubuntu generated binary is PIE although the producer record in its dwarf doesn't show anything related to it: # eu-readelf --debug-dump=info a.out.fedora | fgrep -i produ producer (strp) "GNU C17 13.2.1 20231011 (Red Hat 13.2.1-4) -mtune=generic -march=x86-64 -g" [root@rawh build]# eu-readelf --debug-dump=info a.out.ubuntu | fgrep -i produ producer (strp) "GNU C17 11.4.0 -mtune=generic -march=x86-64 -g -fasynchronous-unwind-tables -fstack-protector-strong -fstack-clash-protection -fcf-protection" # However, looking at the `gcc -E -v a.c` output, it looks like the Ubuntu GCC was configured with --enable-default-pie, while the Fedora GCC doesn't seem to have that. The fact that this doesn't propagate to the producer seems a bit confusing.
Created attachment 15303 [details] possible patch
The patch looks sane. One thing that crossed my mind was that _stp_umodule_relocate is being called for both userspace and kernel code. Has there been some testing that this patch doesn't break things for kernel backtraces? Unfortunately, the likely test for checking whether things are working kernels is testsuite/systemtap.context/context.exp is broken on newer machines.
Right, call the kernel variant of the function if !user (task==0).
Created attachment 15311 [details] partial fix of the context.exp testcase Fix build time problems in systemtap_test_module1.c There are more problems in this context.exp testcase, looking into it...
Created attachment 15313 [details] partial fix of the context.exp testcase Get closer to working context.exp. Subtests `backtrace` and `symfileline` that are still broken seem to show real problems. On track looking into it...
Commits 3e003d693f1498e2ddb675b9b409373a8126045e and bf56ee72a9c78a9e084046c76912a383707ab298 improve context.exp to a state that args.tcl backtrace.tcl num_args.tcl and pid.tcl are now passing. The remaining symfileline.tcl shows a valid problem re reading a file name from .debug_line_str. Will address that within a separate bug. IOW this is where we are now: Running /root/systemtap/testsuite/systemtap.context/context.exp ... FAIL: symfileline () FAIL: symfile () FAIL: symfileline in pp () === systemtap Summary === # of expected passes 32 # of unexpected failures 3
Created attachment 15328 [details] possible patch (In reply to William Cohen from comment #6) > The patch looks sane. One thing that crossed my mind was that > _stp_umodule_relocate is being called for both userspace and kernel code. Right, the relocation should only apply to userspace binaries! I've rearranged the patch slightly to achieve this, and verified that it still works with the Ubuntu PIE binary. I've also tested that it also works with a -m32 PIE binary coming from Ubuntu. So far so good. > Has there been some testing that this patch doesn't break things for kernel > backtraces? Unfortunately, the likely test for checking whether things are > working kernels is testsuite/systemtap.context/context.exp is broken on > newer machines. I'm almost done fixing the context.exp testcase (and respective systemtap bits). The remaining symfileline.tcl subtest shows a suspected problem in the _stp_filename_lookup_5() function. Seems like in case of a kernel module (KO) it's using (slightly) wrong offsets to the .debug_line_str section, and then outputs broken file and directory names. The sections themselves don't seem to need additional relocation: After hardcoding offsets from eu-readelf, things start working (!). Summary, on Rawhide, the KOs are using DWARF5, and the structure of the .debug_line section there seems slightly different compared to what _stp_filename_lookup_5() can handle now (there are two line number tables instead of expected one for instance). Looking into the KO DWARF byte-by-byte...
Fixed the said KO processing "mystery" in commit 011af964e27889c (sigh...). With this, the context.exp is finally happy: kernel location: /usr/lib/debug/lib/modules/6.7.0-0.rc0.20231104git90b0c2b2edd1.7.fc40.x86_64/vmlinux kernel version: 6.7.0-0.rc0.20231104git90b0c2b2edd1.7.fc40.x86_64 systemtap location: /usr/local/bin/stap systemtap version: version 5.1/0.189, release-5.0a-60-g2c7b106c-dirty gcc location: /usr/bin/gcc gcc version: gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4) Running /root/systemtap/testsuite/systemtap.context/context.exp ... === systemtap Summary === # of expected passes 35
Fixed in commit d1ea490253710dc4d59e86ce5ba8ac7d3e7c537c .