Bug 14577 - dlopen does not unload failed module - second dlopen succeeds
Summary: dlopen does not unload failed module - second dlopen succeeds
Status: RESOLVED DUPLICATE of bug 25112
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.12
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-13 15:11 UTC by Peter Åstrand
Modified: 2024-05-07 12:43 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
test case (842 bytes, application/x-tar)
2012-09-14 10:15 UTC, Pierre Ossman
Details
Illustrates ld.so confusion w/ STB_GNU_UNIQUE and failed dlopen (1.50 KB, application/x-tar)
2014-04-10 19:49 UTC, cnewbold
Details
test case "standalone.tgz" (1021 bytes, application/x-gzip)
2014-06-24 16:24 UTC, Michael Fenn
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Åstrand 2012-09-13 15:11:00 UTC
We have a .so file that cannot be loaded on a certain machine, due to missing dependencies. In this case, dlopen() correctly returns NULL. However, the module stays in adress space, and a second dlopen() returns non-NULL. Trying to call functions in the module, however, will result in a segfault. Here's a short test program:

#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>

/* 
   build:
   
   gcc -Wall dlopen.c -ldl -o dlopen 

   run:
   
   LD_BIND_NOW=1 ./dlopen ./foo.so 
*/

int main(int argc, char **argv)
{
        void *lib;

        lib = dlopen(argv[1], RTLD_LAZY);
        fprintf(stderr, "Lib: %p\n", lib);

        lib = dlopen(argv[1], RTLD_LAZY);
        fprintf(stderr, "Lib: %p\n", lib);
        
        return 0;
}

Execution example:


$ LD_BIND_NOW=1 ./dlopen ./module-alsa-sink.so 
Lib: (nil)
Lib: 0x1295030

The .so file is admittedly bad since, but it seems strange the dlopen() should report success the second time. 
Output from ldd:

$ ldd -r ./module-alsa-sink.so 
	linux-vdso.so.1 =>  (0x00007fff8ffff000)
	libpulsecore-UNKNOWN.UNKNOWN.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecore-UNKNOWN.UNKNOWN.so (0x00007f324db90000)
	libpulsecommon-UNKNOWN.UNKNOWN.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecommon-UNKNOWN.UNKNOWN.so (0x00007f324d8c2000)
	libpulse.so.0 => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulse.so.0 (0x00007f324d665000)
	libalsa-util.so => /home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so (0x00007f324d420000)
	libasound.so.2 => /lib64/libasound.so.2 (0x00007f324d120000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f324cf03000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f324ccfb000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f324caf6000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f324c872000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f324c4df000)
	/lib64/ld-linux-x86-64.so.2 (0x00000035ee800000)
symbol snd_pcm_hw_params_can_disable_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference	(/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so)
symbol snd_pcm_hw_params_set_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference	(/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so)
symbol snd_pcm_hw_params_get_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference	(/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so)

The system is a CentOS 6 x86_64 with all updates installed.
Comment 1 Pierre Ossman 2012-09-14 08:13:34 UTC
There is something special about this specific .so file that triggers the issue. We've constructed a simple test case with the same scenario, and the problem doesn't happen there. The failed module is properly unloaded from the address space, and every dlopen() behaves the same.

Running with LD_DEBUG=all, there is a difference between the two:

Test case (OK):

      9339:	symbol=library_function;  lookup in file=./program [0]
      9339:	symbol=library_function;  lookup in file=/lib64/libdl.so.2 [0]
      9339:	symbol=library_function;  lookup in file=/lib64/libc.so.6 [0]
      9339:	symbol=library_function;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
      9339:	symbol=library_function;  lookup in file=./module.so [0]
      9339:	symbol=library_function;  lookup in file=./library.so [0]
      9339:	symbol=library_function;  lookup in file=/lib64/libc.so.6 [0]
      9339:	symbol=library_function;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
      9339:	./module.so: error: symbol lookup error: undefined symbol: library_function (fatal)
      9339:	
      9339:	file=./module.so [0];  destroying link map
      9339:	
      9339:	file=./library.so [0];  destroying link map

Real case (fail):

     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=./dlopen [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/lib64/libdl.so.2 [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/lib64/libc.so.6 [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/module-alsa-sink.so [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecore-UNKNOWN.UNKNOWN.so [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulsecommon-UNKNOWN.UNKNOWN.so [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libpulse.so.0 [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so [0]
     16829:	symbol=snd_pcm_hw_params_can_disable_period_wakeup;  lookup in file=/lib64/libasound.so.2 [0]
     16829:	/home/astrand/ctc/client/pulseaudio-new/src/.libs/libalsa-util.so: error: relocation error: symbol snd_pcm_hw_params_can_disable_period_wakeup, version ALSA_0.9 not defined in file libasound.so.2 with link time reference (fatal)

There is no "destroying link map" for the failing case.
Comment 2 Pierre Ossman 2012-09-14 10:11:22 UTC
Problem identified. The so file confusing glibc is marked as NODELETE, which messes with the unloading. So the fix should probably be to make sure NODELETE isn't respected for files that haven't been fully loaded yet.
Comment 3 Pierre Ossman 2012-09-14 10:15:19 UTC
Created attachment 6633 [details]
test case

Test case that provokes this bug. Example run:

~/devel/dlfail
[ossman@ossman]$ ./program 
Opening module.so (first attempt)...
Result: OK
Getting function...
Result: OK
Calling function...
./program: symbol lookup error: ./module.so: undefined symbol: library_function

~/devel/dlfail
[ossman@ossman]$ LD_BIND_NOW=1 ./program 
Opening module.so (first attempt)...
Result: fail
Opening module.so (second attempt)...
Result: OK
Getting function...
Result: OK
Calling function...
Segmentation fault (core dumped)


The structure is:

 program === dlopen() ==> module.so === dynlink ==> library.so

But library.so is constructed so it lacks a symbol that module.so expects.
Comment 4 cnewbold 2014-04-10 19:48:16 UTC
We've tracked down an issue with very similar symptoms that is likely related to the issue documented here. In our case, the library which cannot load and which then gets stuck in a half-loaded state does not have the NODELETE flag set but does contain an STB_GNU_UNIQUE symbol.

The issue only occurs when a library (1) contains an STB_GNU_UNIQUE symbol that has not yet been loaded/resolved from some other library; and (2) fails to load for some reason, such as with RTLD_NOW and unresolved references. Under these conditions, ld.so loads and resolves the STB_GNU_UNIQUE symbol as the libray is loaded and notes that dependency. However, when the load ultimately fails with an unresolved symbol, the linkage to the STB_GNU_UNIQUE symbol is not undone--which then prevents the cleanup from the failed load from fully unloading the library.

I've attached a test case which provokes this failure.

The wrinkle when STB_GNU_UNIQUE is involved is that if a definition of the same STB_GNU_UNIQUE symbol is already present in the process, the load will fail and will properly clean up the partially-loaded library.

The test case I've attached illustrates this by comparing the behavior of two different load orders:

    (1) h = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL); /* fails -- unresolved */
    (2) h = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NOLOAD); /* succeeds?!? */
    (3) if (h) dlclose(h); /* assertion from within dlclose! */

vs.

    (1) h_good = dlopen("good.so", RTLD_NOW | RTLD_GLOBAL); /* succeeds */
    (2) h_bad = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL); /* fails -- unresolved */
    (3) h_bad = dlopen("bad.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NOLOAD); /* fails again */

Where both bad.so and good.so contain instances of the same STB_GNU_UNIQUE symbol.

I'm not sure whether this warrants a separate bug report or not.
Comment 5 cnewbold 2014-04-10 19:49:33 UTC
Created attachment 7547 [details]
Illustrates ld.so confusion w/ STB_GNU_UNIQUE and failed dlopen
Comment 6 Michael Fenn 2014-06-24 16:22:40 UTC
We have a further refinement of the observations of cnewbold in #4.  In this case, we have been able to get code to execute that resides in a similarly not-quite loaded file.  We have also found that this occurs even with the RTLD_LOCAL flag.  In the attached example we create an executable and two .so files (sh -x output below):

+ gcc --version
gcc (GCC) 4.7.2
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

+ STDFLAG=-std=c++11
+ g++ -o objs/test_plugger -Wl,--export-dynamic -std=c++11 test_plugger.cxx -ldl
+ g++ -o objs/plugin0.so -shared -fPIC -std=c++11 plugin0.cxx
+ g++ -o objs/plugin1.so -shared -fPIC -std=c++11 plugin1.cxx

The executable will attempt to dlopen (with RTLD_NOW | RTLD_LOCAL) its arguments in order.

plugin1.so cannot be loaded due to a missing symbol, but contains an STB_GNU_UNIQUE symbol which happens to be a function pointer.

plugin0.so can be loaded, but contains an identically named STB_GNU_UNIQUE symbol, and will (in a static initialization section) set an extern function pointer in the host program to that symbol, but its symbol resolves to the function pointer in plugin1.so not the one in plugin0.so.

The host dereferences the pointer and calls, getting the definition from plugin1.so to execute.

+ ./objs/test_plugger ./objs/plugin1.so ./objs/plugin0.so
test_plugger.cxx:14 loading ./objs/plugin1.so
test_plugger.cxx:16 failed! dlerror = ./objs/plugin1.so: undefined symbol: _Z21host_doesnt_have_thisv
test_plugger.cxx:14 loading ./objs/plugin0.so
plugin0.cxx:19 _init()
test_plugger.cxx:18 success!
plugin1.cxx:8 execute()  # this should be plugin0.cxx

Source files are attached as standalone.tgz.  The example can be compiled and run by

$ sh compile_and_run.sh
Comment 7 Michael Fenn 2014-06-24 16:24:29 UTC
Created attachment 7656 [details]
test case "standalone.tgz"
Comment 8 Carlos O'Donell 2016-12-22 16:38:33 UTC
Adding Florian Weimer to the CC since he fixed a few of the cases where objects that failed to load were not properly unloaded.
Comment 9 Florian Weimer 2024-05-07 11:38:44 UTC
I believe we have fixed this issue under bug 25112, which went into glibc 2.31 (but some distributions have backported the relevant commit and related dynamic linker corrections into earlier versions they maintain).

Does this issue persist?
Comment 10 Michael Fenn 2024-05-07 12:42:04 UTC
Thanks, Florian.  I tested standalone.tgz on Rocky Linux 9 (glibc 2.34) and it works as expected:

+ ./objs/test_plugger ./objs/plugin1.so ./objs/plugin0.so
test_plugger.cxx:14 loading ./objs/plugin1.so
test_plugger.cxx:16 failed! dlerror = ./objs/plugin1.so: undefined symbol: _Z21host_doesnt_have_thisv
test_plugger.cxx:14 loading ./objs/plugin0.so
plugin0.cxx:19 _init()
test_plugger.cxx:18 success!
plugin0.cxx:8 execute()   # correctly executed code from plugin0
Comment 11 Florian Weimer 2024-05-07 12:43:32 UTC
Resolving per comment 10. Thanks.

*** This bug has been marked as a duplicate of bug 25112 ***