Bug 30062 - glibc 2.36 loader has intermittent segfault error on ppc64le
Summary: glibc 2.36 loader has intermittent segfault error on ppc64le
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.36
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-31 05:45 UTC by ChenQi
Modified: 2023-02-02 04:11 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2023-01-31 00:00:00


Attachments
Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemuppc64 (2.64 KB, text/x-log)
2023-02-01 04:46 UTC, ChenQi
Details
Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemux86-64 (2.67 KB, text/x-log)
2023-02-01 04:48 UTC, ChenQi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ChenQi 2023-01-31 05:45:06 UTC
Summary:
syslog-ng loads and then unloads a few shared libraries at runtime, and on ppc64le, we have intermittent segfault error. I enabled LD_DEBUG and the last few line in log is: calling fini: /usr/lib64/syslog-ng/librate-limit-filter.so [0]. More detailed bug report are as below.


glibc version: 2.36 (the latest release/2.36/master has the same issue)

build host: x86-64; target: ppc64le

options given to configure (except a few cross-compilation specific options): 
--enable-kernel=3.10.0 \
--disable-profile \
--disable-debug \
--without-gd \
--enable-clocale=gnu \
--without-selinux   \
--enable-tunables  \
--enable-bind-now  \
--enable-stack-protector=strong  \
--disable-crypt    \
--with-default-link  \
--disable-werror     \
--with-cpu=power9  \
--disable-static \
--enable-memory-tagging \
--enable-nscd

kernel: 6.1.5

gcc version: 12.2.0

More detailed description:
Running syslog-ng on qemuppc64 gives us intermittent segfault error.
The easiest way to reproduce the bug is running `syslog-ng -V' a few times.
I checked the source codes of syslog-ng, what it does is doing loading & unloading of shared libraries under /usr/lib64/syslog-ng to get some basic information of the libraries. What it really invokes is 'g_module_open' and 'g_module_close'. Below are the error messages:
[ 41.117221] syslog-ng[187]: segfault (11) at 7fffcb0cd9a0 nip 7fffcb0cd9a0 lr 7fffcb0cd9a0 code 2
[ 41.118589] syslog-ng[187]: code: cb0cd9a0 00007fff 24d0cca0 00000001 b41b3200 00007fff b539b770 00007fff
[ 41.118734] syslog-ng[187]: code: 00000001 00000000 24d0bd10 00000001 00007fff 24d24370 00000001
Segmentation fault (core dumped)

backtrace (note that it could segfault on any .so file, the libkvformat.so is just one example):
Core was generated by `/usr/sbin/syslog-ng -F --enable-core --cfgfile /etc/syslog-ng/syslog-ng.conf --'.
Program terminated with signal SIGSEGV, Segmentation fault.
--Type for more, q to quit, c to continue without paging--
#0 0x00007fff9ace34c8 in __do_global_dtors_aux ()
from /usr/lib64/syslog-ng/libkvformat.so
(gdb) bt
#0 0x00007fff9ace34c8 in __do_global_dtors_aux ()
from /usr/lib64/syslog-ng/libkvformat.so
#1 0x00007fff9ace34b8 in __do_global_dtors_aux ()
from /usr/lib64/syslog-ng/libkvformat.so
#2 0x00007fff9ba401d4 in __GI__dl_catch_exception (exception=0x7fff9acff458,
operate=0x0, args=0x11cc50ff0) at dl-error-skeleton.c:182
#3 0x00007fff9be12d3c in _dl_close_worker (force=false, map=)
at dl-close.c:292
#4 _dl_close (_map=) at dl-close.c:818
#5 0x00007fff9ba40174 in __GI__dl_catch_exception (
exception=exception@entry=0x7ffff25d7230,
operate=0x7fff9be12930 <_dl_close>, args=0x11cc50ff0)
at dl-error-skeleton.c:208
#6 0x00007fff9ba4025c in __GI__dl_catch_error (objname=0x7ffff25d72b8,
errstring=0x7ffff25d72b0, mallocedp=0x7ffff25d72c0,
operate=, args=) at dl-error-skeleton.c:227
#7 0x00007fff9be41888 in _rtld_catch_error (objname=,
errstring=, mallocedp=,
operate=, args=) at dl-error-skeleton.c:260
#8 0x00007fff9b928378 in _dlerror_run (operate=,
args=) at dlerror.c:138
#9 0x00007fff9b927f3c in __dlclose (handle=) at dlclose.c:31
#10 0x00007fff9b4f18f8 in _g_module_close (handle=)
--Type for more, q to quit, c to continue without paging--
at /usr/src/debug/glib-2.0/1_2.74.4-r0/gmodule/gmodule-dl.c:190
#11 g_module_close (module=0x11cc4e8a0)
at /usr/src/debug/glib-2.0/1_2.74.4-r0/gmodule/gmodule.c:753
#12 0x00007fff9bcf6350 in plugin_discover_candidate_modules (
context=0x11cc46630) at /usr/src/debug/syslog-ng/3.38.1-r0/lib/plugin.c:532
#13 0x00007fff9bcc6828 in cfg_discover_candidate_modules (self=)
at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:208
#14 cfg_discover_candidate_modules (self=)
at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:203
#15 0x00007fff9bcc787c in cfg_read_config (self=,
fname=, preprocess_into=)
at /usr/src/debug/syslog-ng/3.38.1-r0/lib/cfg.c:655
#16 0x00007fff9bceb2b0 in main_loop_read_and_init_config (
self=0x7fff9bdc35b0 <main_loop>)
at /usr/src/debug/syslog-ng/3.38.1-r0/lib/mainloop.c:618
#17 0x000000011cbe21e0 in main (argc=, argv=)
at /usr/src/debug/syslog-ng/3.38.1-r0/syslog-ng/main.c:284

The last few lines of the LD_DEBUG_OUTPUT when segfault happens after LD_DEBUG is enabled (as said above, it could segfault on any .so file, this time it's librate-limit-filter.so:
306: symbol=g_module_check_init; lookup in file=/usr/lib64/libssl.so.3 [0]
306: symbol=g_module_check_init; lookup in file=/usr/lib64/libcrypto.so.3 [0]
306: symbol=g_module_check_init; lookup in file=/lib64/libc.so.6 [0]
306: symbol=g_module_check_init; lookup in file=/lib64/ld64.so.2 [0]
306: symbol=g_module_check_init; lookup in file=/lib64/libcap.so.2 [0]
306: symbol=g_module_check_init; lookup in file=/usr/lib64/libzstd.so.1 [0]
306: symbol=g_module_check_init; lookup in file=/usr/lib64/libpcre2-8.so.0 [0]
306: /usr/lib64/syslog-ng/librate-limit-filter.so: error: symbol lookup error: undefined symbol: g_module_check_init (fatal)
306: symbol=g_module_unload; lookup in file=/usr/lib64/syslog-ng/librate-limit-filter.so [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libsyslog-ng-3.38.so.0 [0]
306: symbol=g_module_unload; lookup in file=/lib64/libm.so.6 [0]
306: symbol=g_module_unload; lookup in file=/lib64/librt.so.1 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libevtlog-3.38.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre.so.1 [0]
306: symbol=g_module_unload; lookup in file=/lib64/libsystemd.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libsecret-storage.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libgmodule-2.0.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libgthread-2.0.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libglib-2.0.so.0 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libssl.so.3 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libcrypto.so.3 [0]
306: symbol=g_module_unload; lookup in file=/lib64/libc.so.6 [0]
306: symbol=g_module_unload; lookup in file=/lib64/ld64.so.2 [0]
306: symbol=g_module_unload; lookup in file=/lib64/libcap.so.2 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libzstd.so.1 [0]
306: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre2-8.so.0 [0]
306: /usr/lib64/syslog-ng/librate-limit-filter.so: error: symbol lookup error: undefined symbol: g_module_unload (fatal)
306: symbol=module_info; lookup in file=/usr/lib64/syslog-ng/librate-limit-filter.so [0]
306:
306: calling fini: /usr/lib64/syslog-ng/librate-limit-filter.so [0]
306:

As a comparison, here are a few lines when no segfault happens:
305: symbol=g_module_check_init; lookup in file=/usr/lib64/libglib-2.0.so.0 [0]
305: symbol=g_module_check_init; lookup in file=/usr/lib64/libssl.so.3 [0]
305: symbol=g_module_check_init; lookup in file=/usr/lib64/libcrypto.so.3 [0]
305: symbol=g_module_check_init; lookup in file=/lib64/libc.so.6 [0]
305: symbol=g_module_check_init; lookup in file=/lib64/ld64.so.2 [0]
305: symbol=g_module_check_init; lookup in file=/lib64/libcap.so.2 [0]
305: symbol=g_module_check_init; lookup in file=/usr/lib64/libzstd.so.1 [0]
305: symbol=g_module_check_init; lookup in file=/usr/lib64/libpcre2-8.so.0 [0]
305: /usr/lib64/syslog-ng/libappmodel.so: error: symbol lookup error: undefined symbol: g_module_check_init (fatal)
305: symbol=g_module_unload; lookup in file=/usr/lib64/syslog-ng/libappmodel.so [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libsyslog-ng-3.38.so.0 [0]
305: symbol=g_module_unload; lookup in file=/lib64/libm.so.6 [0]
305: symbol=g_module_unload; lookup in file=/lib64/librt.so.1 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libevtlog-3.38.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre.so.1 [0]
305: symbol=g_module_unload; lookup in file=/lib64/libsystemd.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libsecret-storage.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libgmodule-2.0.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libgthread-2.0.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libglib-2.0.so.0 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libssl.so.3 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libcrypto.so.3 [0]
305: symbol=g_module_unload; lookup in file=/lib64/libc.so.6 [0]
305: symbol=g_module_unload; lookup in file=/lib64/ld64.so.2 [0]
305: symbol=g_module_unload; lookup in file=/lib64/libcap.so.2 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libzstd.so.1 [0]
305: symbol=g_module_unload; lookup in file=/usr/lib64/libpcre2-8.so.0 [0]
305: /usr/lib64/syslog-ng/libappmodel.so: error: symbol lookup error: undefined symbol: g_module_unload (fatal)
305: symbol=module_info; lookup in file=/usr/lib64/syslog-ng/libappmodel.so [0]
305:
305: calling fini: /usr/lib64/syslog-ng/libappmodel.so [0]
305:
305:
305: file=/usr/lib64/syslog-ng/libappmodel.so [0]; destroying link map
305:
305:
305: file=/usr/lib64/syslog-ng/libappmodel.so [0]; destroying link map
305:
305: calling fini: syslog-ng [0]
305:
305:
305: calling fini: /usr/lib64/libsyslog-ng-3.38.so.0 [0]
305:
305:
305: calling fini: /lib64/libm.so.6 [0]
305:
305:
305: calling fini: /lib64/librt.so.1 [0]
305:
305:
305: calling fini: /usr/lib64/libevtlog-3.38.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libpcre.so.1 [0]
305:
305:
305: calling fini: /lib64/libsystemd.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libsecret-storage.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libgmodule-2.0.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libgthread-2.0.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libglib-2.0.so.0 [0]
305:
305:
305: calling fini: /usr/lib64/libssl.so.3 [0]
305:
305:
305: calling fini: /usr/lib64/libcrypto.so.3 [0]
305:
305:
305: calling fini: /usr/lib64/libpcre2-8.so.0 [0]
305:
305:
305: calling fini: /lib64/libcap.so.2 [0]
305:
305:
305: calling fini: /usr/lib64/libzstd.so.1 [0]
305:


The .fini contents:
/usr/lib64/syslog-ng/librate-limit-filter.so:     file format elf64-powerpcle


Disassembly of section .fini:

0000000000004a4c <.fini>:
    4a4c:       02 00 4c 3c     addis   r2,r12,2
    4a50:       b4 34 42 38     addi    r2,r2,13492
    4a54:       a6 02 08 7c     mflr    r0
    4a58:       10 00 01 f8     std     r0,16(r1)
    4a5c:       a1 ff 21 f8     stdu    r1,-96(r1)
    4a60:       60 00 21 38     addi    r1,r1,96
    4a64:       10 00 01 e8     ld      r0,16(r1)
    4a68:       a6 03 08 7c     mtlr    r0
    4a6c:       20 00 80 4e     blr
root@qemuppc64:~#


Reproduce steps:
step 1: install necessary packages
Use ubuntu as an example:
sudo apt install gawk wget git diffstat unzip texinfo gcc build-essential chrpath socat cpio
python3 python3-pip python3-pexpect xz-utils debianutils iputils-ping python3-git
python3-jinja2 libegl1-mesa libsdl1.2-dev pylint3 xterm python3-subunit mesa-common-dev zstd liblz4-tool

For other hosts, refer to https://docs.yoctoproject.org/ref-manual/system-requirements.html#required-packages-for-the-build-host

step 2: clone repo, build image, start image via qemu
mkdir workdir && cd workdir
git clone git://git.yoctoproject.org/poky
git clone git://git.openembedded.org/meta-openembedded
. poky/oe-init-build-env build
bitbake-layers add-layer ../meta-openembedded/meta-*
cat >> conf/local.conf <<EOF
MACHINE = "qemuppc64"
IMAGE_INSTALL:append = " strace iproute2 kernel-modules syslog-ng gdb packagegroup-core-buildessential valgrind"
IMAGE_FEATURES:append = " dbg-pkgs src-pkgs dev-pkgs"
IMAGE_FEATURES:append = " package-management ssh-server-openssh"
INIT_MANAGER:forcevariable = "systemd"
PACKAGECONFIG:append:pn-systemd = " coredump"
DEBUG_BUILD = "1"
EOF
bitbake core-image-minimal
runqemu qemuppc64 nographic qemuparams="-m 4096 -smp 2"

step 3: reproduce the issue on target
method 1:
systemctl status syslog-ng@default
systemctl restart syslog-ng@default (run this and the issue is reproduced, if not, run it a few more times)

method 2:
syslog-ng -V (run this command and the issue is reproduced, if not, run it a few more times)

Analysis that I have done:
Except the above backtrace and LD_DEBUG testing, I also done one more testing which made me believe the bug is in glibc.
I changed glibc to musl to check if it's the glibc's issue or not. The result is: when using musl, there's no segfault error. That's why I filed a bug here.


related syslog-ng issue link:
https://github.com/syslog-ng/syslog-ng/issues/4285
Comment 1 Carlos O'Donell 2023-01-31 20:56:51 UTC
(In reply to ChenQi from comment #0)
> Except the above backtrace and LD_DEBUG testing, I also done one more
> testing which made me believe the bug is in glibc.

The best bet here is to run with Valgrind and try to find the invalid access.

> I changed glibc to musl to check if it's the glibc's issue or not. The
> result is: when using musl, there's no segfault error. That's why I filed a
> bug here.
 
Musl's dynamic loader will never unload a library and calling dlclose() does not run the library desctructors.

glibc's dynamic loader will attempt to run destructors and unload and unmap the library.

Under glibc you must accurately account for any dangling references to the library after unloading.

Thus I expect that this is related to the libraries you are loading and unloading and their specific constructors and destructors.

I haven't seen any reports of this in Fedora 37, which uses glibc 2.36, and has syslog-ng compiled for ppc64le.
 
> related syslog-ng issue link:
> https://github.com/syslog-ng/syslog-ng/issues/4285

Thanks. I commented there.

Next steps:
- Please run your reproducer under valgrind and see if you can determine the root cause of the crash.
Comment 2 ChenQi 2023-02-01 04:46:50 UTC
Created attachment 14646 [details]
Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemuppc64

The attachment is the output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemuppc64
Comment 3 ChenQi 2023-02-01 04:48:18 UTC
Created attachment 14647 [details]
Output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V' on qemux86-64

The attachment is the output of 'valgrind --leak-check=full --show-leak-kinds=all syslog-ng -V'.
Comment 4 Carlos O'Donell 2023-02-01 16:40:59 UTC
I don't see anything obviously bad in the valgrind logs.

This may be a race condition in the DSOs you load and the data that they access, or in the threads they create, and later join.

There isn't any conclusive proof here that we have a dyanmic loader race.

Please update this when you have more information from debugging the reproducer.

I asked in the github issue, and I'll ask here for the record, what hardware are you using?
Comment 5 ChenQi 2023-02-02 02:18:16 UTC
I'm running it on qemuppc64. I'll create a simple script and try to reproduce the issue under valgrind.
Comment 6 Carlos O'Donell 2023-02-02 02:49:31 UTC
(In reply to ChenQi from comment #5)
> I'm running it on qemuppc64. I'll create a simple script and try to
> reproduce the issue under valgrind.

Could you please expand on this? Does this mean you're running qemu-system-ppc64 on x86_64 hardware? Or are you running qemu-kvm on POWER hardware? Just for clarity, all the Fedora uses I quoted are running virtualized with qemu, but on real ppc64le hardware.
Comment 7 ChenQi 2023-02-02 02:53:10 UTC
(In reply to Carlos O'Donell from comment #6)
> (In reply to ChenQi from comment #5)
> > I'm running it on qemuppc64. I'll create a simple script and try to
> > reproduce the issue under valgrind.
> 
> Could you please expand on this? Does this mean you're running
> qemu-system-ppc64 on x86_64 hardware? Or are you running qemu-kvm on POWER
> hardware? Just for clarity, all the Fedora uses I quoted are running
> virtualized with qemu, but on real ppc64le hardware.

Yes, I'm running qemu-system-ppc64 on x86-64 hardware. The OS is Ubuntu18.04.
The qemu commandline is:
qemu-system-ppc64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:35:06 -netdev user,id=net0,hostfwd=tcp:127.0.0.1:2224-:22,hostfwd=tcp:127.0.0.1:2325-:23,tftp=/buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -drive file=/buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64/core-image-minimal-qemuppc64-20230201020834.rootfs.ext4,if=virtio,format=raw -usb -device usb-tablet -machine pseries -cpu POWER9 -smp 2 -m 4096 -m 4096 -smp 2 -serial mon:stdio -serial null -nographic -kernel /buildarea2/chenqi/builds/build-poky-sysvinit/tmp/deploy/images/qemuppc64/vmlinux -append root=/dev/vda rw  mem=4096M ip=dhcp console=hvc0 console=hvc0 nohugevmalloc
Comment 8 Carlos O'Donell 2023-02-02 04:11:31 UTC
(In reply to ChenQi from comment #7)
> Yes, I'm running qemu-system-ppc64 on x86-64 hardware.

Thanks. So aside from a syslog-ng (and associated DSOs), kernel or glibc bug... there could also be a POWER9 emulation bug in qemu-system.

Segfaulting in any DSO is odd (does point to a common-mode fault), and something you'll have to track down by debugging the exact faulting instruction and what causes it to fault, why and which mappings are involved in that memory region.