Summary: syslog-ng loads and then unloads a few shared libraries at runtime, and on ppc64le, we have intermittent segfault error. I enabled LD_DEBUG and the last few line in log is: calling fini: /usr/lib64/syslog-ng/librate-limit-filter.so [0]. More detailed bug report are as below. glibc version: 2.36 (the latest release/2.36/master has the same issue) build host: x86-64; target: ppc64le options given to configure (except a few cross-compilation specific options): --enable-kernel=3.10.0 \ --disable-profile \ --disable-debug \ --without-gd \ --enable-clocale=gnu \ --without-selinux \ --enable-tunables \ --enable-bind-now \ --enable-stack-protector=strong \ --disable-crypt \ --with-default-link \ --disable-werror \ --with-cpu=power9 \ --disable-static \ --enable-memory-tagging \ --enable-nscd kernel: 6.1.5 gcc version: 12.2.0 More detailed description: Running syslog-ng on qemuppc64 gives us intermittent segfault error. The easiest way to reproduce the bug is running `syslog-ng -V' a few times. I checked the source codes of syslog-ng, what it does is doing loading & unloading of shared libraries under /usr/lib64/syslog-ng to get some basic information of the libraries. What it really invokes is 'g_module_open' and 'g_module_close'. Below are the error messages: [ 41.117221] syslog-ng[187]: segfault (11) at 7fffcb0cd9a0 nip 7fffcb0cd9a0 lr 7fffcb0cd9a0 code 2 [ 41.118589] syslog-ng[187]: code: cb0cd9a0 00007fff 24d0cca0 00000001 b41b3200 00007fff b539b770 00007fff [ 41.118734] syslog-ng[187]: code: 00000001 00000000 24d0bd10 00000001 00007fff 24d24370 00000001 Segmentation fault (core dumped) backtrace (note that it could segfault on any .so file, the libkvformat.so is just one example): Core was generated by `/usr/sbin/syslog-ng -F --enable-core --cfgfile /etc/syslog-ng/syslog-ng.conf --'. Program terminated with signal SIGSEGV, Segmentation fault. --Type for more, q to quit, c to continue without paging-- #0 0x00007fff9ace34c8 in __do_global_dtors_aux () from /usr/lib64/syslog-ng/libkvformat.so (gdb) bt #0 0x00007fff9ace34c8 in __do_global_dtors_aux () from /usr/lib64/syslog-ng/libkvformat.so #1 0x00007fff9ace34b8 in __do_global_dtors_aux () from /usr/lib64/syslog-ng/libkvformat.so #2 0x00007fff9ba401d4 in __GI__dl_catch_exception (exception=0x7fff9acff458, operate=0x0, args=0x11cc50ff0) at dl-error-skeleton.c:182 #3 0x00007fff9be12d3c in _dl_close_worker (force=false, map=) at dl-close.c:292 #4 _dl_close (_map=) at dl-close.c:818 #5 0x00007fff9ba40174 in __GI__dl_catch_exception ( exception=exception@entry=0x7ffff25d7230, operate=0x7fff9be12930 <_dl_close>, args=0x11cc50ff0) at dl-error-skeleton.c:208 #6 0x00007fff9ba4025c in __GI__dl_catch_error (objname=0x7ffff25d72b8, errstring=0x7ffff25d72b0, mallocedp=0x7ffff25d72c0, operate=, args=) at dl-error-skeleton.c:227 #7 0x00007fff9be41888 in _rtld_catch_error (objname=, errstring=, mallocedp=, operate=, args=) at dl-error-skeleton.c:260 #8 0x00007fff9b928378 in _dlerror_run (operate=, args=) at dlerror.c:138 #9 0x00007fff9b927f3c in __dlclose (handle=) at dlclose.c:31 #10 0x00007fff9b4f18f8 in _g_module_close (handle=) --Type for more, q to quit, c to continue without paging-- at /usr/src/debug/glib-2.0/1_2.74.4-r0/gmodule/gmodule-dl.c:190 #11 g_module_close (module=0x11cc4e8a0) at /usr/src/debug/glib-2.0/1_2.74.4-r0/gmodule/gmodule.c:753 #12 0x00007fff9bcf6350 in plugin_discover_candidate_modules ( context=0x11cc46630) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/plugin.c:532 #13 0x00007fff9bcc6828 in cfg_discover_candidate_modules (self=) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:208 #14 cfg_discover_candidate_modules (self=) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:203 #15 0x00007fff9bcc787c in cfg_read_config (self=, fname=, preprocess_into=) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:655 #16 0x00007fff9bceb2b0 in main_loop_read_and_init_config ( self=0x7fff9bdc35b0 <main_loop>) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/mainloop.c:618 #17 0x000000011cbe21e0 in main (argc=, argv=) at /usr/src/debug/syslog-ng/3.38.1-r0/syslog-ng/main.c:284 The last few lines of the LD_DEBUG_OUTPUT when segfault happens after LD_DEBUG is enabled (as said above, it could segfault on any .so file, this time it's librate-limit-filter.so: 306: symbol=g_module_check_init; lookup in file=/usr/lib64/libssl.so.3 [0] 306: symbol=g_module_check_init; lookup in file=/usr/lib64/libcrypto.so.3 [0] 306: symbol=g_module_check_init; lookup in file=/lib64/libc.so.6 [0] 306: symbol=g_module_check_init; lookup in file=/lib64/ld64.so.2 [0] 306: symbol=g_module_check_init; lookup in file=/lib64/libcap.so.2 [0] 306: symbol=g_module_check_init; lookup in file=/usr/lib64/libzstd.so.1 [0] 306: symbol=g_module_check_init; lookup in file=/usr/lib64/libpcre2-8.so.0 [0] 306: /usr/lib64/syslog-ng/librate-limit-filter.so: error: symbol lookup error: undefined symbol: g_module_check_init (fatal) 306: symbol=g_module_unload; lookup in file=/usr/lib64/syslog-ng/librate-limit-filter.so [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libsyslog-ng-3.38.so.0 [0] 306: symbol=g_module_unload; lookup in file=/lib64/libm.so.6 [0] 306: symbol=g_module_unload; lookup in file=/lib64/librt.so.1 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libevtlog-3.38.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre.so.1 [0] 306: symbol=g_module_unload; lookup in file=/lib64/libsystemd.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libsecret-storage.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libgmodule-2.0.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libgthread-2.0.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libglib-2.0.so.0 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libssl.so.3 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libcrypto.so.3 [0] 306: symbol=g_module_unload; lookup in file=/lib64/libc.so.6 [0] 306: symbol=g_module_unload; lookup in file=/lib64/ld64.so.2 [0] 306: symbol=g_module_unload; lookup in file=/lib64/libcap.so.2 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libzstd.so.1 [0] 306: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre2-8.so.0 [0] 306: /usr/lib64/syslog-ng/librate-limit-filter.so: error: symbol lookup error: undefined symbol: g_module_unload (fatal) 306: symbol=module_info; lookup in file=/usr/lib64/syslog-ng/librate-limit-filter.so [0] 306: 306: calling fini: /usr/lib64/syslog-ng/librate-limit-filter.so [0] 306: As a comparison, here are a few lines when no segfault happens: 305: symbol=g_module_check_init; lookup in file=/usr/lib64/libglib-2.0.so.0 [0] 305: symbol=g_module_check_init; lookup in file=/usr/lib64/libssl.so.3 [0] 305: symbol=g_module_check_init; lookup in file=/usr/lib64/libcrypto.so.3 [0] 305: symbol=g_module_check_init; lookup in file=/lib64/libc.so.6 [0] 305: symbol=g_module_check_init; lookup in file=/lib64/ld64.so.2 [0] 305: symbol=g_module_check_init; lookup in file=/lib64/libcap.so.2 [0] 305: symbol=g_module_check_init; lookup in file=/usr/lib64/libzstd.so.1 [0] 305: symbol=g_module_check_init; lookup in file=/usr/lib64/libpcre2-8.so.0 [0] 305: /usr/lib64/syslog-ng/libappmodel.so: error: symbol lookup error: undefined symbol: g_module_check_init (fatal) 305: symbol=g_module_unload; lookup in file=/usr/lib64/syslog-ng/libappmodel.so [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libsyslog-ng-3.38.so.0 [0] 305: symbol=g_module_unload; lookup in file=/lib64/libm.so.6 [0] 305: symbol=g_module_unload; lookup in file=/lib64/librt.so.1 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libevtlog-3.38.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre.so.1 [0] 305: symbol=g_module_unload; lookup in file=/lib64/libsystemd.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libsecret-storage.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libgmodule-2.0.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libgthread-2.0.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libglib-2.0.so.0 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libssl.so.3 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libcrypto.so.3 [0] 305: symbol=g_module_unload; lookup in file=/lib64/libc.so.6 [0] 305: symbol=g_module_unload; lookup in file=/lib64/ld64.so.2 [0] 305: symbol=g_module_unload; lookup in file=/lib64/libcap.so.2 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libzstd.so.1 [0] 305: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre2-8.so.0 [0] 305: /usr/lib64/syslog-ng/libappmodel.so: error: symbol lookup error: undefined symbol: g_module_unload (fatal) 305: symbol=module_info; lookup in file=/usr/lib64/syslog-ng/libappmodel.so [0] 305: 305: calling fini: /usr/lib64/syslog-ng/libappmodel.so [0] 305: 305: 305: file=/usr/lib64/syslog-ng/libappmodel.so [0]; destroying link map 305: 305: 305: file=/usr/lib64/syslog-ng/libappmodel.so [0]; destroying link map 305: 305: calling fini: syslog-ng [0] 305: 305: 305: calling fini: /usr/lib64/libsyslog-ng-3.38.so.0 [0] 305: 305: 305: calling fini: /lib64/libm.so.6 [0] 305: 305: 305: calling fini: /lib64/librt.so.1 [0] 305: 305: 305: calling fini: /usr/lib64/libevtlog-3.38.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libpcre.so.1 [0] 305: 305: 305: calling fini: /lib64/libsystemd.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libsecret-storage.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libgmodule-2.0.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libgthread-2.0.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libglib-2.0.so.0 [0] 305: 305: 305: calling fini: /usr/lib64/libssl.so.3 [0] 305: 305: 305: calling fini: /usr/lib64/libcrypto.so.3 [0] 305: 305: 305: calling fini: /usr/lib64/libpcre2-8.so.0 [0] 305: 305: 305: calling fini: /lib64/libcap.so.2 [0] 305: 305: 305: calling fini: /usr/lib64/libzstd.so.1 [0] 305: The .fini contents: /usr/lib64/syslog-ng/librate-limit-filter.so: file format elf64-powerpcle Disassembly of section .fini: 0000000000004a4c <.fini>: 4a4c: 02 00 4c 3c addis r2,r12,2 4a50: b4 34 42 38 addi r2,r2,13492 4a54: a6 02 08 7c mflr r0 4a58: 10 00 01 f8 std r0,16(r1) 4a5c: a1 ff 21 f8 stdu r1,-96(r1) 4a60: 60 00 21 38 addi r1,r1,96 4a64: 10 00 01 e8 ld r0,16(r1) 4a68: a6 03 08 7c mtlr r0 4a6c: 20 00 80 4e blr root@qemuppc64:~# Reproduce steps: step 1: install necessary packages Use ubuntu as an example: sudo apt install gawk wget git diffstat unzip texinfo gcc build-essential chrpath socat cpio python3 python3-pip python3-pexpect xz-utils debianutils iputils-ping python3-git python3-jinja2 libegl1-mesa libsdl1.2-dev pylint3 xterm python3-subunit mesa-common-dev zstd liblz4-tool For other hosts, refer to https://docs.yoctoproject.org/ref-manual/system-requirements.html#required-packages-for-the-build-host step 2: clone repo, build image, start image via qemu mkdir workdir && cd workdir git clone git://git.yoctoproject.org/poky git clone git://git.openembedded.org/meta-openembedded . poky/oe-init-build-env build bitbake-layers add-layer ../meta-openembedded/meta-* cat >> conf/local.conf <<EOF MACHINE = "qemuppc64" IMAGE_INSTALL:append = " strace iproute2 kernel-modules syslog-ng gdb packagegroup-core-buildessential valgrind" IMAGE_FEATURES:append = " dbg-pkgs src-pkgs dev-pkgs" IMAGE_FEATURES:append = " package-management ssh-server-openssh" INIT_MANAGER:forcevariable = "systemd" PACKAGECONFIG:append:pn-systemd = " coredump" DEBUG_BUILD = "1" EOF bitbake core-image-minimal runqemu qemuppc64 nographic qemuparams="-m 4096 -smp 2" step 3: reproduce the issue on target method 1: systemctl status syslog-ng@default systemctl restart syslog-ng@default (run this and the issue is reproduced, if not, run it a few more times) method 2: syslog-ng -V (run this command and the issue is reproduced, if not, run it a few more times) Analysis that I have done: Except the above backtrace and LD_DEBUG testing, I also done one more testing which made me believe the bug is in glibc. I changed glibc to musl to check if it's the glibc's issue or not. The result is: when using musl, there's no segfault error. That's why I filed a bug here. related syslog-ng issue link: https://github.com/syslog-ng/syslog-ng/issues/4285
(In reply to ChenQi from comment #0) > Except the above backtrace and LD_DEBUG testing, I also done one more > testing which made me believe the bug is in glibc. The best bet here is to run with Valgrind and try to find the invalid access. > I changed glibc to musl to check if it's the glibc's issue or not. The > result is: when using musl, there's no segfault error. That's why I filed a > bug here. Musl's dynamic loader will never unload a library and calling dlclose() does not run the library desctructors. glibc's dynamic loader will attempt to run destructors and unload and unmap the library. Under glibc you must accurately account for any dangling references to the library after unloading. Thus I expect that this is related to the libraries you are loading and unloading and their specific constructors and destructors. I haven't seen any reports of this in Fedora 37, which uses glibc 2.36, and has syslog-ng compiled for ppc64le. > related syslog-ng issue link: > https://github.com/syslog-ng/syslog-ng/issues/4285 Thanks. I commented there. Next steps: - Please run your reproducer under valgrind and see if you can determine the root cause of the crash.
Created attachment 14646 [details] Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemuppc64 The attachment is the output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemuppc64
Created attachment 14647 [details] Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemux86-64 The attachment is the output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V'.
I don't see anything obviously bad in the valgrind logs. This may be a race condition in the DSOs you load and the data that they access, or in the threads they create, and later join. There isn't any conclusive proof here that we have a dyanmic loader race. Please update this when you have more information from debugging the reproducer. I asked in the github issue, and I'll ask here for the record, what hardware are you using?
I'm running it on qemuppc64. I'll create a simple script and try to reproduce the issue under valgrind.
(In reply to ChenQi from comment #5) > I'm running it on qemuppc64. I'll create a simple script and try to > reproduce the issue under valgrind. Could you please expand on this? Does this mean you're running qemu-system-ppc64 on x86_64 hardware? Or are you running qemu-kvm on POWER hardware? Just for clarity, all the Fedora uses I quoted are running virtualized with qemu, but on real ppc64le hardware.
(In reply to Carlos O'Donell from comment #6) > (In reply to ChenQi from comment #5) > > I'm running it on qemuppc64. I'll create a simple script and try to > > reproduce the issue under valgrind. > > Could you please expand on this? Does this mean you're running > qemu-system-ppc64 on x86_64 hardware? Or are you running qemu-kvm on POWER > hardware? Just for clarity, all the Fedora uses I quoted are running > virtualized with qemu, but on real ppc64le hardware. Yes, I'm running qemu-system-ppc64 on x86-64 hardware. The OS is Ubuntu18.04. The qemu commandline is: qemu-system-ppc64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:35:06 -netdev user,id=net0,hostfwd=tcp:127.0.0.1:2224-:22,hostfwd=tcp:127.0.0.1:2325-:23,tftp=/buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64/core-image-minimal-qemuppc64-20230201020834.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -machine pseries -cpu POWER9 -smp 2 -m 4096 -m 4096 -smp 2 -serial mon:stdio -serial null -nographic -kernel /buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64/vmlinux -append root=/dev/vda rw mem=4096M ip=dhcp console=hvc0 console=hvc0 nohugevmalloc
(In reply to ChenQi from comment #7) > Yes, I'm running qemu-system-ppc64 on x86-64 hardware. Thanks. So aside from a syslog-ng (and associated DSOs), kernel or glibc bug... there could also be a POWER9 emulation bug in qemu-system. Segfaulting in any DSO is odd (does point to a common-mode fault), and something you'll have to track down by debugging the exact faulting instruction and what causes it to fault, why and which mappings are involved in that memory region.