We have a .so file that cannot be loaded on a certain machine, due to missing dependencies. In this case, dlopen() correctly returns NULL. However, the module stays in adress space, and a second dlopen() returns non-NULL. Trying to call functions in the module, however, will result in a segfault. Here's a short test program: #include <stdlib.h> #include <stdio.h> #include <dlfcn.h> /* build: gcc -Wall dlopen.c -ldl -o dlopen run: LD_BIND_NOW=1 ./dlopen ./foo.so */ int main(int argc, char **argv) { void *lib; lib = dlopen(argv[1], RTLD_LAZY); fprintf(stderr, "Lib: %p\n", lib); lib = dlopen(argv[1], RTLD_LAZY); fprintf(stderr, "Lib: %p\n", lib); return 0; } Execution example: $ LD_BIND_NOW=1 ./dlopen ./module-alsa-sink.so Lib: (nil) Lib: 0x1295030 The .so file is admittedly bad since, but it seems strange the dlopen() should report success the second time. Output from ldd: $ ldd -r ./module-alsa-sink.so linux-vdso.so.1 => (0x00007fff8ffff000) libpulsecore-UNKNOWN.UNKNOWN.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecore-UNKNOWN.UNKNOWN.so (0x00007f324db90000) libpulsecommon-UNKNOWN.UNKNOWN.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecommon-UNKNOWN.UNKNOWN.so (0x00007f324d8c2000) libpulse.so.0 => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulse.so.0 (0x00007f324d665000) libalsa-util.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so (0x00007f324d420000) libasound.so.2 => /lib64/libasound.so.2 (0x00007f324d120000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f324cf03000) librt.so.1 => /lib64/librt.so.1 (0x00007f324ccfb000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f324caf6000) libm.so.6 => /lib64/libm.so.6 (0x00007f324c872000) libc.so.6 => /lib64/libc.so.6 (0x00007f324c4df000) /lib64/ld-linux-x86-64.so.2 (0x00000035ee800000) symbol snd_pcm_hw_params_can_disable_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference (/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so) symbol snd_pcm_hw_params_set_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference (/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so) symbol snd_pcm_hw_params_get_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference (/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so) The system is a CentOS 6 x86_64 with all updates installed.
There is something special about this specific .so file that triggers the issue. We've constructed a simple test case with the same scenario, and the problem doesn't happen there. The failed module is properly unloaded from the address space, and every dlopen() behaves the same. Running with LD_DEBUG=all, there is a difference between the two: Test case (OK): 9339: symbol=library_function; lookup in file=./program [0] 9339: symbol=library_function; lookup in file=/lib64/libdl.so.2 [0] 9339: symbol=library_function; lookup in file=/lib64/libc.so.6 [0] 9339: symbol=library_function; lookup in file=/lib64/ld-linux-x86-64.so.2 [0] 9339: symbol=library_function; lookup in file=./module.so [0] 9339: symbol=library_function; lookup in file=./library.so [0] 9339: symbol=library_function; lookup in file=/lib64/libc.so.6 [0] 9339: symbol=library_function; lookup in file=/lib64/ld-linux-x86-64.so.2 [0] 9339: ./module.so: error: symbol lookup error: undefined symbol: library_function (fatal) 9339: 9339: file=./module.so [0]; destroying link map 9339: 9339: file=./library.so [0]; destroying link map Real case (fail): 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=./dlopen [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/lib64/libdl.so.2 [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/lib64/libc.so.6 [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/lib64/ld-linux-x86-64.so.2 [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/module-alsa-sink.so [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecore-UNKNOWN.UNKNOWN.so [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecommon-UNKNOWN.UNKNOWN.so [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulse.so.0 [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so [0] 16829: symbol=snd_pcm_hw_params_can_disable_period_wakeup; lookup in file=/lib64/libasound.so.2 [0] 16829: /home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so: error: relocation error: symbol snd_pcm_hw_params_can_disable_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference (fatal) There is no "destroying link map" for the failing case.
Problem identified. The so file confusing glibc is marked as NODELETE, which messes with the unloading. So the fix should probably be to make sure NODELETE isn't respected for files that haven't been fully loaded yet.
Created attachment 6633 [details] test case Test case that provokes this bug. Example run: ~/devel/dlfail [ossman@ossman]$ ./program Opening module.so (first attempt)... Result: OK Getting function... Result: OK Calling function... ./program: symbol lookup error: ./module.so: undefined symbol: library_function ~/devel/dlfail [ossman@ossman]$ LD_BIND_NOW=1 ./program Opening module.so (first attempt)... Result: fail Opening module.so (second attempt)... Result: OK Getting function... Result: OK Calling function... Segmentation fault (core dumped) The structure is: program === dlopen() ==> module.so === dynlink ==> library.so But library.so is constructed so it lacks a symbol that module.so expects.
We've tracked down an issue with very similar symptoms that is likely related to the issue documented here. In our case, the library which cannot load and which then gets stuck in a half-loaded state does not have the NODELETE flag set but does contain an STB_GNU_UNIQUE symbol. The issue only occurs when a library (1) contains an STB_GNU_UNIQUE symbol that has not yet been loaded/resolved from some other library; and (2) fails to load for some reason, such as with RTLD_NOW and unresolved references. Under these conditions, ld.so loads and resolves the STB_GNU_UNIQUE symbol as the libray is loaded and notes that dependency. However, when the load ultimately fails with an unresolved symbol, the linkage to the STB_GNU_UNIQUE symbol is not undone--which then prevents the cleanup from the failed load from fully unloading the library. I've attached a test case which provokes this failure. The wrinkle when STB_GNU_UNIQUE is involved is that if a definition of the same STB_GNU_UNIQUE symbol is already present in the process, the load will fail and will properly clean up the partially-loaded library. The test case I've attached illustrates this by comparing the behavior of two different load orders: (1) h = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL); /* fails -- unresolved */ (2) h = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NOLOAD); /* succeeds?!? */ (3) if (h) dlclose(h); /* assertion from within dlclose! */ vs. (1) h_good = dlopen("good.so", RTLD_NOW | RTLD_GLOBAL); /* succeeds */ (2) h_bad = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL); /* fails -- unresolved */ (3) h_bad = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NOLOAD); /* fails again */ Where both bad.so and good.so contain instances of the same STB_GNU_UNIQUE symbol. I'm not sure whether this warrants a separate bug report or not.
Created attachment 7547 [details] Illustrates ld.so confusion w/ STB_GNU_UNIQUE and failed dlopen
We have a further refinement of the observations of cnewbold in #4. In this case, we have been able to get code to execute that resides in a similarly not-quite loaded file. We have also found that this occurs even with the RTLD_LOCAL flag. In the attached example we create an executable and two .so files (sh -x output below): + gcc --version gcc (GCC) 4.7.2 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + STDFLAG=-std=c++11 + g++ -o objs/test_plugger -Wl,--export-dynamic -std=c++11 test_plugger.cxx -ldl + g++ -o objs/plugin0.so -shared -fPIC -std=c++11 plugin0.cxx + g++ -o objs/plugin1.so -shared -fPIC -std=c++11 plugin1.cxx The executable will attempt to dlopen (with RTLD_NOW | RTLD_LOCAL) its arguments in order. plugin1.so cannot be loaded due to a missing symbol, but contains an STB_GNU_UNIQUE symbol which happens to be a function pointer. plugin0.so can be loaded, but contains an identically named STB_GNU_UNIQUE symbol, and will (in a static initialization section) set an extern function pointer in the host program to that symbol, but its symbol resolves to the function pointer in plugin1.so not the one in plugin0.so. The host dereferences the pointer and calls, getting the definition from plugin1.so to execute. + ./objs/test_plugger ./objs/plugin1.so ./objs/plugin0.so test_plugger.cxx:14 loading ./objs/plugin1.so test_plugger.cxx:16 failed! dlerror = ./objs/plugin1.so: undefined symbol: _Z21host_doesnt_have_thisv test_plugger.cxx:14 loading ./objs/plugin0.so plugin0.cxx:19 _init() test_plugger.cxx:18 success! plugin1.cxx:8 execute() # this should be plugin0.cxx Source files are attached as standalone.tgz. The example can be compiled and run by $ sh compile_and_run.sh
Created attachment 7656 [details] test case "standalone.tgz"
Adding Florian Weimer to the CC since he fixed a few of the cases where objects that failed to load were not properly unloaded.
I believe we have fixed this issue under bug 25112, which went into glibc 2.31 (but some distributions have backported the relevant commit and related dynamic linker corrections into earlier versions they maintain). Does this issue persist?
Thanks, Florian. I tested standalone.tgz on Rocky Linux 9 (glibc 2.34) and it works as expected: + ./objs/test_plugger ./objs/plugin1.so ./objs/plugin0.so test_plugger.cxx:14 loading ./objs/plugin1.so test_plugger.cxx:16 failed! dlerror = ./objs/plugin1.so: undefined symbol: _Z21host_doesnt_have_thisv test_plugger.cxx:14 loading ./objs/plugin0.so plugin0.cxx:19 _init() test_plugger.cxx:18 success! plugin0.cxx:8 execute() # correctly executed code from plugin0
Resolving per comment 10. Thanks. *** This bug has been marked as a duplicate of bug 25112 ***