Buildbot failure in Wildebeest Builder on whole buildset

Mark Wielaard mark@klomp.org
Thu Dec 16 17:05:04 GMT 2021


Hi,

On Thu, 2021-12-16 at 01:10 +0000, buildbot@builder.wildebeest.org
wrote:
> The Buildbot has detected a new failure on builder elfutils-centos-
> x86_64 while building elfutils.
> Full details are available at:
>     https://builder.wildebeest.org/buildbot/#builders/1/builds/884
> 
> Buildbot URL: https://builder.wildebeest.org/buildbot/
> 
> Worker for this Build: centos-x86_64
> 
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <alex@linutronix.de>
> 
> BUILD FAILED: failed test (failure)
> 
> Sincerely,
>  -The BuildbotThe Buildbot has detected a new failure on builder
> elfutils-fedora-x86_64 while building elfutils.
> Full details are available at:
>     https://builder.wildebeest.org/buildbot/#builders/3/builds/876
> 
> Buildbot URL: https://builder.wildebeest.org/buildbot/
> 
> Worker for this Build: fedora-x86_64
> 
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <alex@linutronix.de>
> 
> BUILD FAILED: failed test (failure)

So this is really unfortunate and has nothing to do with the patch from
Alexander.

These are two different, but related failures.

On centos-x86_64 this is:

FAIL: run-backtrace-native-core-biarch.sh
=========================================

/usr/bin/coredumpctl
0xf77ac000	0xf77ad000	linux-gate.so.1
0xf77ad000	0xf77d08fc	ld-linux.so.2
0xf75b4000	0xf777ea1c	libc.so.6
0xf777f000	0xf7799248	libpthread.so.0
0x5658e000	0x56591050	backtrace-child-biarch
TID 24658:
# 0 0xf77ac430    	__kernel_vsyscall
# 1 0xf778dd16 - 1	raise
# 2 0x5658eafc - 1	sigusr2
# 3 0x5658ebeb - 1	stdarg
# 4 0x5658ec2f - 1	backtracegen
# 5 0x5658ec38 - 1	start
# 6 0xf7785bbc - 1	start_thread
# 7 0xf76b227e - 1	__clone
TID 24656:
# 0 0xf76b2268    	__clone
/srv/buildbot/worker/elfutils-centos-x86_64/build/tests/backtrace:
dwfl_thread_getframes: No DWARF information found
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 24682 Aborted                 (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-biarch-core.24656: no main

Note that this is a i386 process being backtraced on x86_64.

On fedora-x86_64 this is:


FAIL: run-backtrace-native-core.sh
==================================

/usr/bin/coredumpctl
0x7ffd3d934000	0x7ffd3d935000	linux-vdso.so.1
0x7f4ccbf99000	0x7f4ccbfcd200	ld-linux-x86-64.so.2
0x7f4ccbd7b000	0x7f4ccbf84ad0	libc.so.6
0x56038dbfc000	0x56038dc000a8	backtrace-child
TID 3043057:
# 0 0x7f4ccbe0a89c    	__pthread_kill_implementation
# 1 0x7f4ccbdbd6b6 - 1	raise
# 2 0x56038dbfd3fd - 1	sigusr2
# 3 0x56038dbfd4ca - 1	stdarg
# 4 0x56038dbfd4e0 - 1	backtracegen
# 5 0x56038dbfd4e9 - 1	start
# 6 0x7f4ccbe08ad7 - 1	start_thread
# 7 0x7f4ccbe8d770 - 1	__clone3
TID 3043052:
# 0 0x7f4ccbe8d75d    	__clone3
/srv/buildbot/worker/elfutils-fedora-x86_64/build/tests/backtrace:
dwfl_thread_getframes: address out of range
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 3043062 Aborted                 (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-core.3043052: no main
rmdir: failed to remove 'test-3043029': Directory not empty
FAIL run-backtrace-native-core.sh (exit status: 1)

This is an x86_64 process core being backtraced on x86_64.

The problem in both cases is that the parent cannot unwind from the
exact pc it is stuck at. With eu-stack -v --core we can see (for the
parent TID):

TID 3043052:
#0  0x00007f4ccbe8d75d     __clone3 - libc.so.6
    ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62
eu-stack: dwfl_thread_getframes tid 3043052 at 0x7f4ccbe8d75d in
libc.so.6: address out of range

That is this source code:

ENTRY (__clone3)
        /* Sanity check arguments.  */
        movl    $-EINVAL, %eax
        test    %RDI_LP, %RDI_LP        /* No NULL cl_args pointer.  */
        jz      SYSCALL_ERROR_LABEL
        test    %RDX_LP, %RDX_LP        /* No NULL function pointer.  */
        jz      SYSCALL_ERROR_LABEL

        /* Save the cl_args pointer in R8 which is preserved by the
           syscall.  */
        mov     %RCX_LP, %R8_LP

        /* Do the system call.  */
        movl    $SYS_ify(clone3), %eax

        /* End FDE now, because in the child the unwind info will be
           wrong.  */
        cfi_endproc
        syscall

=>      test    %RAX_LP, %RAX_LP
        jl      SYSCALL_ERROR_LABEL
        jz      L(thread_start)

        ret

L(thread_start):
        cfi_startproc
        /* Clearing frame pointer is insufficient, use CFI.  */
        cfi_undefined (rip)
        /* Clear the frame pointer.  The ABI suggests this be done, to mark
           the outermost frame obviously.  */
        xorl    %ebp, %ebp

        /* Align stack to 16 bytes per the x86-64 psABI.  */
        and     $-16, %RSP_LP
[...]

So the PC is right after the syscall, when as the code says there is no
CFI. Apparently the child ran first and quickly got to the terminating
kill, while the parent was still stuck in the syscall (or just out of
it, but not yet returned from the clone3 call.

I think some synchronization is missed between the parent and child.
But the test code is fairly complex.

Cheers,

Mark 


More information about the Elfutils-devel mailing list