Following test-case fails: $ ./tests/backtrace-dwarf 0x3ffbd840622 raise 0x3ffbd823ce2 abort ./tests/backtrace-dwarf: dwfl_thread_getframes: no error Fortunately I have an access to s390x machine, thus I can help with debugging. The binary is build with GCC 8.1.1.
Note that we have an s390x fedora buildbot worker that also uses GCC 8.1.1: https://builder.wildebeest.org/buildbot/#/workers/5 That one is green. So I suspect it is either a different binutils or glibc (the above buildbot worker has glibc 2.27 and binutils 2.29.1) or different build/CFLAGS/defaults.
$ ld --version GNU ld (GNU Binutils; openSUSE:Factory:zSystems) 2.31 $ /lib64/libc.so.6 GNU C Library (GNU libc) stable release version 2.28 (git 3c03baca37fd).
It does seem to work correctly on Fedora 29 with gcc 8.2, binutils 2.31 and glibc 2.28: https://kojipkgs.fedoraproject.org//packages/elfutils/0.174/1.fc29/data/logs/s390x/build.log PASS: run-backtrace-dwarf.sh So it is probably some difference is default/build flags.
Created attachment 11257 [details] openSUSE build log I'm attaching my build log. In general, I guess following flags are used: -std=gnu99 -Wall -Wshadow -Wformat=2 -Wold-style-definition -Wstrict-prototypes -Wlogical-op -Wduplicated-cond -Wnull-dereference -Wimplicit-fallthrough=5 -Werror -Wunused -Wextra -Wstack-usage=262144 -fPIC -O2 -g -m64 -fmessage-length=0 -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g
We reviewed this on irc and came to the surprising conclusion that this was caused by ptrace TRACEME failing with EPERM. That is really odd. But not a bug in elfutils IMHO.
I've just played with that and I did an error: one can't utilize ptrace and open an executable in gdb. That causes the EPERM errno. So the issue is still valid in my opinion.
Note that it's not related to 0.174. I can see it also in 0.173, so as Mark mentioned it's dependent on glibc, bintuils, ..
If a process is not being traced and PTRACE_TRACEME fails with EPERM, then it must be a kernel issue.
Hm, on x86_64 (on trunk) I see all tests OK, but: $ ./backtrace-dwarf backtrace-dwarf: backtrace-dwarf.c:146: main: Assertion `errno == 0' failed. 0x7ffff7a4f08b raise 0x7ffff7a384e9 abort 0x7ffff7a383c1 __assert_fail_base.cold.0 0x7ffff7a476f2 __assert_fail 0x40135a main which should not happen. On my machine I see errno == 2. I would expect the test will fail with: diff --git a/tests/backtrace-dwarf.c b/tests/backtrace-dwarf.c index e1eb4928..273d2b5e 100644 --- a/tests/backtrace-dwarf.c +++ b/tests/backtrace-dwarf.c @@ -143,8 +143,8 @@ main (int argc __attribute__ ((unused)), char **argv) abort (); case 0:; long l = ptrace (PTRACE_TRACEME, 0, NULL, NULL); - assert (errno == 0); - assert (l == 0); + if (errno != 0 || l != 0) + return -1; cleanup_13_main (); abort (); default: but it's still fine, while: ./backtrace-dwarf backtrace-dwarf: backtrace-dwarf.c:159: main: Assertion `WIFSTOPPED (status)' failed. Aborted (core dumped) That said, the tests looks to me very fragile..
I'd suggest the following change to enhance error diagnostics: diff --git a/tests/backtrace-dwarf.c b/tests/backtrace-dwarf.c index 35f25ed6..3a22db31 100644 --- a/tests/backtrace-dwarf.c +++ b/tests/backtrace-dwarf.c @@ -143,9 +143,8 @@ main (int argc __attribute__ ((unused)), char **argv) case -1: abort (); case 0:; - long l = ptrace (PTRACE_TRACEME, 0, NULL, NULL); - assert (errno == 0); - assert (l == 0); + if (ptrace (PTRACE_TRACEME, 0, NULL, NULL)) + _exit(errno ?: -1); cleanup_13_main (); abort (); default: @@ -155,10 +154,12 @@ main (int argc __attribute__ ((unused)), char **argv) errno = 0; int status; pid_t got = waitpid (pid, &status, 0); - assert (errno == 0); - assert (got == pid); - assert (WIFSTOPPED (status)); - assert (WSTOPSIG (status) == SIGABRT); + if (got != pid) + error (1, errno, "waitpid returned %d", got); + if (!WIFSTOPPED (status)) + error (1, 0, "unexpected wait status %u", status); + if (WSTOPSIG (status) != SIGABRT) + error (1, 0, "unexpected signal %u", WSTOPSIG (status)); Dwfl *dwfl = pid_to_dwfl (pid); dwfl_getthreads (dwfl, thread_callback, NULL);
With the suggested patch I see following in test-suite.log on s390x: [ 86s] + cat tests/test-suite.log [ 86s] ========================================== [ 86s] elfutils 0.174: tests/test-suite.log [ 86s] ========================================== [ 86s] [ 86s] # TOTAL: 202 [ 86s] # PASS: 194 [ 86s] # SKIP: 7 [ 86s] # XFAIL: 0 [ 86s] # FAIL: 1 [ 86s] # XPASS: 0 [ 86s] # ERROR: 0 [ 86s] [ 86s] .. contents:: :depth: 2 [ 86s] [ 86s] SKIP: run-addr2line-i-demangle-test.sh [ 86s] ====================================== [ 86s] [ 86s] demangler unsupported [ 86s] SKIP run-addr2line-i-demangle-test.sh (exit status: 77) [ 86s] [ 86s] SKIP: run-backtrace-data.sh [ 86s] =========================== [ 86s] [ 86s] /home/abuild/rpmbuild/BUILD/elfutils-0.174/tests/backtrace-data: Unwinding not supported for this architecture [ 86s] data: arch not supported [ 86s] SKIP run-backtrace-data.sh (exit status: 77) [ 86s] [ 86s] FAIL: run-backtrace-dwarf.sh [ 86s] ============================ [ 86s] [ 86s] 0x3ffbda40622 raise [ 86s] 0x3ffbda23ce2 abort [ 86s] /home/abuild/rpmbuild/BUILD/elfutils-0.174/tests/backtrace-dwarf: dwfl_thread_getframes: no error [ 86s] dwarf: no main [ 86s] FAIL run-backtrace-dwarf.sh (exit status: 1) [ 86s] [ 86s] SKIP: run-backtrace-native-core.sh [ 86s] ================================== [ 86s] [ 86s] No core.12202 file generated [ 86s] SKIP run-backtrace-native-core.sh (exit status: 77) [ 86s] [ 86s] SKIP: run-backtrace-native-core-biarch.sh [ 86s] ========================================= [ 86s] [ 86s] No core.12218 file generated [ 86s] SKIP run-backtrace-native-core-biarch.sh (exit status: 77) [ 86s] [ 86s] SKIP: run-backtrace-demangle.sh [ 86s] =============================== [ 86s] [ 86s] demangler unsupported [ 86s] SKIP run-backtrace-demangle.sh (exit status: 77) [ 86s] [ 86s] SKIP: run-stack-demangled-test.sh [ 86s] ================================= [ 86s] [ 86s] demangler unsupported [ 86s] SKIP run-stack-demangled-test.sh (exit status: 77) [ 86s] [ 86s] SKIP: run-lfs-symbols.sh [ 86s] ======================== [ 86s] [ 86s] LFS testing is irrelevent on this system [ 86s] SKIP run-lfs-symbols.sh (exit status: 77) [ 86s]
(In reply to Martin Liska from comment #11) > With the suggested patch I see following in test-suite.log on s390x: [...] > [ 86s] FAIL: run-backtrace-dwarf.sh > [ 86s] ============================ > [ 86s] > [ 86s] 0x3ffbda40622 raise > [ 86s] 0x3ffbda23ce2 abort > [ 86s] /home/abuild/rpmbuild/BUILD/elfutils-0.174/tests/backtrace-dwarf: > dwfl_thread_getframes: no error > [ 86s] dwarf: no main > [ 86s] FAIL run-backtrace-dwarf.sh (exit status: 1) This doesn't look like a PTRACE_TRACEME failing with EPERM, abort() has actually been invoked by the tracee.
(In reply to Dmitry V. Levin from comment #12) > (In reply to Martin Liska from comment #11) > > With the suggested patch I see following in test-suite.log on s390x: > [...] > > [ 86s] FAIL: run-backtrace-dwarf.sh > > [ 86s] ============================ > > [ 86s] > > [ 86s] 0x3ffbda40622 raise > > [ 86s] 0x3ffbda23ce2 abort > > [ 86s] /home/abuild/rpmbuild/BUILD/elfutils-0.174/tests/backtrace-dwarf: > > dwfl_thread_getframes: no error > > [ 86s] dwarf: no main > > [ 86s] FAIL run-backtrace-dwarf.sh (exit status: 1) > > This doesn't look like a PTRACE_TRACEME failing with EPERM, abort() has > actually been invoked by the tracee. Agree with that, question is how to debug that. Any idea?
The test case does use assert and abort too much. How about we extend Dmitry's patch to get rid of them all (the only abort that should be there is the one in cleanup-13.c). diff --git a/tests/backtrace-dwarf.c b/tests/backtrace-dwarf.c index 35f25ed..498416f 100644 --- a/tests/backtrace-dwarf.c +++ b/tests/backtrace-dwarf.c @@ -16,7 +16,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>. */ #include <config.h> -#include <assert.h> #include <inttypes.h> #include <stdio_ext.h> #include <locale.h> @@ -141,13 +140,18 @@ main (int argc __attribute__ ((unused)), char **argv) switch (pid) { case -1: - abort (); + perror ("fork failed"); + exit (-1); case 0:; long l = ptrace (PTRACE_TRACEME, 0, NULL, NULL); - assert (errno == 0); - assert (l == 0); + if (l != 0) + { + perror ("PTRACE_TRACEME failed"); + exit (-1); + } cleanup_13_main (); - abort (); + printf ("cleanup_13_main returned, impossible...\n"); + exit (-1); default: break; } @@ -155,16 +159,20 @@ main (int argc __attribute__ ((unused)), char **argv) errno = 0; int status; pid_t got = waitpid (pid, &status, 0); - assert (errno == 0); - assert (got == pid); - assert (WIFSTOPPED (status)); - assert (WSTOPSIG (status) == SIGABRT); + if (got != pid) + error (1, errno, "waitpid returned %d", got); + if (!WIFSTOPPED (status)) + error (1, 0, "unexpected wait status %u", status); + if (WSTOPSIG (status) != SIGABRT) + error (1, 0, "unexpected signal %u", WSTOPSIG (status)); Dwfl *dwfl = pid_to_dwfl (pid); - dwfl_getthreads (dwfl, thread_callback, NULL); + if (dwfl_getthreads (dwfl, thread_callback, NULL) == -1) + error (1, 0, "dwfl_getthreads: %s", dwfl_errmsg (-1)); /* There is an exit (0) call if we find the "main" frame, */ - error (1, 0, "dwfl_getthreads: %s", dwfl_errmsg (-1)); + printf ("dwfl_getthreads returned, main not found\n"); + exit (-1); } #endif /* ! __linux__ */
Thanks Mark, I installed the patch but I see still the same. For now, I'm leaving that, I'm not so much interested in s390x ;)
(In reply to Martin Liska from comment #15) > Thanks Mark, I installed the patch but I see still the same. The output was exactly the same? That is surprising. So there is no additional output that explains which failure path was taken? I would have expected at least a message about the dwfl_getthreads call. > For now, I'm > leaving that, I'm not so much interested in s390x ;) Understood if it is too much work to track down. We have other s390x setups that seems fine. But I still don't fully understand the issue.
(In reply to Mark Wielaard from comment #16) > (In reply to Martin Liska from comment #15) > > Thanks Mark, I installed the patch but I see still the same. > > The output was exactly the same? That is surprising. So there is no > additional output that explains which failure path was taken? I would have > expected at least a message about the dwfl_getthreads call. Yes: $ ./backtrace-dwarf 0x3ff8a9c0622 raise 0x3ff8a9a3ce2 abort ./backtrace-dwarf: dwfl_thread_getframes: no error Looks that child correctly triggers assert. > > > For now, I'm > > leaving that, I'm not so much interested in s390x ;) > > Understood if it is too much work to track down. We have other s390x setups > that seems fine. But I still don't fully understand the issue.
(In reply to Martin Liska from comment #17) > (In reply to Mark Wielaard from comment #16) > > (In reply to Martin Liska from comment #15) > > > Thanks Mark, I installed the patch but I see still the same. > > > > The output was exactly the same? That is surprising. So there is no > > additional output that explains which failure path was taken? I would have > > expected at least a message about the dwfl_getthreads call. > > Yes: > > $ ./backtrace-dwarf > 0x3ff8a9c0622 raise > 0x3ff8a9a3ce2 abort > ./backtrace-dwarf: dwfl_thread_getframes: no error > > Looks that child correctly triggers assert. Aha, ok, yes, I missed that dwfl_thread_getthreads just calls dwfl_thread_getframes (there is only one thread) and this does indeed not find the main frame. I'll tweak the testcase a bit more to make it show that. But we now know for sure that it isn't the testframe infrastructure failing, but that the unwinder really seems to not unwind through abort and so doesn't find main. Still don't know what is happening though.
I see a similar looking failure on arm64 on Ubuntu 18.10: https://launchpadlibrarian.net/391377304/buildlog_ubuntu-cosmic-arm64.elfutils_0.170-0.5_BUILDING.txt.gz I've gdb-ed this to the point that the key difference between a working system (Ubuntu 18.04) and the failing one is that libc.so.6 has a lot more entries in .eh_frame_hdr in the failing system. On 18.04 it fails to find a fde for abort() (or raise, I think) and unwinds using .debug_frame and that succeeds. On 18.10 it finds a fde for both raise and abort but fails to successfully unwind past abort using it. I don't know either why the newer libc.so.6 has a bigger eh_frame_hdr (it is glibc 2.28 vs 2.27 but also built with newer gcc and binutils) or why unwinding using eh_frame info fails.
(In reply to Michael Hudson-Doyle from comment #19) > I see a similar looking failure on arm64 on Ubuntu 18.10: > > https://launchpadlibrarian.net/391377304/buildlog_ubuntu-cosmic-arm64. > elfutils_0.170-0.5_BUILDING.txt.gz So, if possible could you build with current git or 0.174 + the patch from comment #14 or commit 69d6e67eee30c483ba53a8e1da1b3568033e3ddecommit 69d6e67eee30c483ba53a8e1da1b3568033e3dde > I've gdb-ed this to the point that the key difference between a working > system (Ubuntu 18.04) and the failing one is that libc.so.6 has a lot more > entries in .eh_frame_hdr in the failing system. On 18.04 it fails to find a > fde for abort() (or raise, I think) and unwinds using .debug_frame and that > succeeds. On 18.10 it finds a fde for both raise and abort but fails to > successfully unwind past abort using it. I don't know either why the newer > libc.so.6 has a bigger eh_frame_hdr (it is glibc 2.28 vs 2.27 but also built > with newer gcc and binutils) or why unwinding using eh_frame info fails. In principle the .eh_frame and .debug_frame should provide the same CFI, although encoded slightly differently. Maybe there is a difference? You should be able to find both with eu-readelf --debug-dump=frame
(In reply to Mark Wielaard from comment #20) > (In reply to Michael Hudson-Doyle from comment #19) > > I see a similar looking failure on arm64 on Ubuntu 18.10: > > > > https://launchpadlibrarian.net/391377304/buildlog_ubuntu-cosmic-arm64. > > elfutils_0.170-0.5_BUILDING.txt.gz > > So, if possible could you build with current git or 0.174 + the patch from > comment #14 or commit 69d6e67eee30c483ba53a8e1da1b3568033e3ddecommit > 69d6e67eee30c483ba53a8e1da1b3568033e3dde Oh hmm current git passes! Sorry for the noise. Oh and obviously f881459ffc95b6fad51aa055a158ee14814073aa fixes this (somehow I failed to read the git log correctly and had to bisect to find it but there's no real excuse for that). > > I've gdb-ed this to the point that the key difference between a working > > system (Ubuntu 18.04) and the failing one is that libc.so.6 has a lot more > > entries in .eh_frame_hdr in the failing system. On 18.04 it fails to find a > > fde for abort() (or raise, I think) and unwinds using .debug_frame and that > > succeeds. On 18.10 it finds a fde for both raise and abort but fails to > > successfully unwind past abort using it. I don't know either why the newer > > libc.so.6 has a bigger eh_frame_hdr (it is glibc 2.28 vs 2.27 but also built > > with newer gcc and binutils) or why unwinding using eh_frame info fails. > > In principle the .eh_frame and .debug_frame should provide the same CFI, > although encoded slightly differently. Maybe there is a difference? You > should be able to find both with eu-readelf --debug-dump=frame I wrote most of what follows while waiting for the test run above to complete but for the record... So something I forgot to mention is that the newer glibc has no .debug_frame (not even in the /usr/lib/debug file that has the other debug data). So in a sense the fact that elfutils is trying to unwind using eh_frame and not trying the debug_frame data at all is actually not relevant here. That said, here is the debug_frame CFI from libc in the working environment: [ 3d28] FDE length=36 cie=[ 3d18] CIE_pointer: 15640 initial_location: +0x0000000000033760 <abort> address_range: 0x228 Program: advance_loc 1 to 0x4 def_cfa_offset 320 offset r29 (x29) at cfa-320 offset r30 (x30) at cfa-312 advance_loc 2 to 0xc def_cfa_register r29 (x29) advance_loc 1 to 0x10 offset r19 (x19) at cfa-304 offset r20 (x20) at cfa-296 And here is the eh_frame CFI from the libc that fails: [ 2b08] FDE length=28 cie=[ 0] CIE_pointer: 11020 initial_location: +0x00000000000207d8 <abort> (offset: 0x207d8) address_range: 0x214 (end offset: 0x209ec) Program: advance_loc 1 to 0x207dc def_cfa_offset 320 offset r29 (x29) at cfa-320 offset r30 (x30) at cfa-312 advance_loc 4 to 0x207ec offset r19 (x19) at cfa-304 offset r20 (x20) at cfa-296 nop nop I guess it's the lack of the def_cfa_register r29 in the eh_frame data that is making the difference.
(In reply to Michael Hudson-Doyle from comment #21) > (In reply to Mark Wielaard from comment #20) > > (In reply to Michael Hudson-Doyle from comment #19) > > > I see a similar looking failure on arm64 on Ubuntu 18.10: > > > > > > https://launchpadlibrarian.net/391377304/buildlog_ubuntu-cosmic-arm64. > > > elfutils_0.170-0.5_BUILDING.txt.gz > > > > So, if possible could you build with current git or 0.174 + the patch from > > comment #14 or commit 69d6e67eee30c483ba53a8e1da1b3568033e3ddecommit > > 69d6e67eee30c483ba53a8e1da1b3568033e3dde > > Oh hmm current git passes! Sorry for the noise. > > Oh and obviously f881459ffc95b6fad51aa055a158ee14814073aa fixes this Cool. So this is different from the s390x issue. Which we sadly don't yet understand. But if that happens again on s390x an inspection of the CFI and whether it comes from .eh_frame or .debug_frame might be helpful.
Just for the record, as of version 0.175 the test works fine on all targets I can test (including s390x).
(In reply to Martin Liska from comment #23) > Just for the record, as of version 0.175 the test works fine on all targets > I can test (including s390x). Lets close this for now. It can be reopened if we have a new test failure.