Buildbot failure in Wildebeest Builder on whole buildset
Mark Wielaard
mark@klomp.org
Thu Dec 16 17:05:04 GMT 2021
Hi,
On Thu, 2021-12-16 at 01:10 +0000, buildbot@builder.wildebeest.org
wrote:
> The Buildbot has detected a new failure on builder elfutils-centos-
> x86_64 while building elfutils.
> Full details are available at:
> https://builder.wildebeest.org/buildbot/#builders/1/builds/884
>
> Buildbot URL: https://builder.wildebeest.org/buildbot/
>
> Worker for this Build: centos-x86_64
>
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <alex@linutronix.de>
>
> BUILD FAILED: failed test (failure)
>
> Sincerely,
> -The BuildbotThe Buildbot has detected a new failure on builder
> elfutils-fedora-x86_64 while building elfutils.
> Full details are available at:
> https://builder.wildebeest.org/buildbot/#builders/3/builds/876
>
> Buildbot URL: https://builder.wildebeest.org/buildbot/
>
> Worker for this Build: fedora-x86_64
>
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <alex@linutronix.de>
>
> BUILD FAILED: failed test (failure)
So this is really unfortunate and has nothing to do with the patch from
Alexander.
These are two different, but related failures.
On centos-x86_64 this is:
FAIL: run-backtrace-native-core-biarch.sh
=========================================
/usr/bin/coredumpctl
0xf77ac000 0xf77ad000 linux-gate.so.1
0xf77ad000 0xf77d08fc ld-linux.so.2
0xf75b4000 0xf777ea1c libc.so.6
0xf777f000 0xf7799248 libpthread.so.0
0x5658e000 0x56591050 backtrace-child-biarch
TID 24658:
# 0 0xf77ac430 __kernel_vsyscall
# 1 0xf778dd16 - 1 raise
# 2 0x5658eafc - 1 sigusr2
# 3 0x5658ebeb - 1 stdarg
# 4 0x5658ec2f - 1 backtracegen
# 5 0x5658ec38 - 1 start
# 6 0xf7785bbc - 1 start_thread
# 7 0xf76b227e - 1 __clone
TID 24656:
# 0 0xf76b2268 __clone
/srv/buildbot/worker/elfutils-centos-x86_64/build/tests/backtrace:
dwfl_thread_getframes: No DWARF information found
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 24682 Aborted (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-biarch-core.24656: no main
Note that this is a i386 process being backtraced on x86_64.
On fedora-x86_64 this is:
FAIL: run-backtrace-native-core.sh
==================================
/usr/bin/coredumpctl
0x7ffd3d934000 0x7ffd3d935000 linux-vdso.so.1
0x7f4ccbf99000 0x7f4ccbfcd200 ld-linux-x86-64.so.2
0x7f4ccbd7b000 0x7f4ccbf84ad0 libc.so.6
0x56038dbfc000 0x56038dc000a8 backtrace-child
TID 3043057:
# 0 0x7f4ccbe0a89c __pthread_kill_implementation
# 1 0x7f4ccbdbd6b6 - 1 raise
# 2 0x56038dbfd3fd - 1 sigusr2
# 3 0x56038dbfd4ca - 1 stdarg
# 4 0x56038dbfd4e0 - 1 backtracegen
# 5 0x56038dbfd4e9 - 1 start
# 6 0x7f4ccbe08ad7 - 1 start_thread
# 7 0x7f4ccbe8d770 - 1 __clone3
TID 3043052:
# 0 0x7f4ccbe8d75d __clone3
/srv/buildbot/worker/elfutils-fedora-x86_64/build/tests/backtrace:
dwfl_thread_getframes: address out of range
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 3043062 Aborted (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-core.3043052: no main
rmdir: failed to remove 'test-3043029': Directory not empty
FAIL run-backtrace-native-core.sh (exit status: 1)
This is an x86_64 process core being backtraced on x86_64.
The problem in both cases is that the parent cannot unwind from the
exact pc it is stuck at. With eu-stack -v --core we can see (for the
parent TID):
TID 3043052:
#0 0x00007f4ccbe8d75d __clone3 - libc.so.6
../sysdeps/unix/sysv/linux/x86_64/clone3.S:62
eu-stack: dwfl_thread_getframes tid 3043052 at 0x7f4ccbe8d75d in
libc.so.6: address out of range
That is this source code:
ENTRY (__clone3)
/* Sanity check arguments. */
movl $-EINVAL, %eax
test %RDI_LP, %RDI_LP /* No NULL cl_args pointer. */
jz SYSCALL_ERROR_LABEL
test %RDX_LP, %RDX_LP /* No NULL function pointer. */
jz SYSCALL_ERROR_LABEL
/* Save the cl_args pointer in R8 which is preserved by the
syscall. */
mov %RCX_LP, %R8_LP
/* Do the system call. */
movl $SYS_ify(clone3), %eax
/* End FDE now, because in the child the unwind info will be
wrong. */
cfi_endproc
syscall
=> test %RAX_LP, %RAX_LP
jl SYSCALL_ERROR_LABEL
jz L(thread_start)
ret
L(thread_start):
cfi_startproc
/* Clearing frame pointer is insufficient, use CFI. */
cfi_undefined (rip)
/* Clear the frame pointer. The ABI suggests this be done, to mark
the outermost frame obviously. */
xorl %ebp, %ebp
/* Align stack to 16 bytes per the x86-64 psABI. */
and $-16, %RSP_LP
[...]
So the PC is right after the syscall, when as the code says there is no
CFI. Apparently the child ran first and quickly got to the terminating
kill, while the parent was still stuck in the syscall (or just out of
it, but not yet returned from the clone3 call.
I think some synchronization is missed between the parent and child.
But the test code is fairly complex.
Cheers,
Mark
More information about the Elfutils-devel
mailing list