Bug 31986 - Loading the same library within an audit library and within an application can cause ld.so to crash with an assert.
Summary: Loading the same library within an audit library and within an application ca...
Status: ASSIGNED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.39
: P2 normal
Target Milestone: ---
Assignee: Florian Weimer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-17 19:00 UTC by Ben Woodard
Modified: 2024-09-19 23:23 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
reproducer (2.90 KB, application/gzip)
2024-07-17 19:00 UTC, Ben Woodard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Woodard 2024-07-17 19:00:37 UTC
Created attachment 15630 [details]
reproducer

This is a particularly serious issue for auditor-based tools that need to interface with binaries within the application namespace. Tools often need to make calls to a library immediately when it is loaded before application code starts to use the library. It is not safe to call into the library prior to its init constructors and the auditor interface does not provide a callback after init constructors have run, thus the only alternative is to "promote" the init constructors through a recursive call to dl*open during la_activity(CONSISTENT).

In particular cases the dynamic linker asserts with:

Inconsistency detected by ld.so: dl-open.c: 627: dl_open_worker_begin: Assertion `r_state == RT_CONSISTENT' failed!
make: [Makefile:19: test] Error 127 (ignored)

to run the attached reproducer simply:

tar xvzf recursive-dlopen-crashes.tar.gz
cd recursive-dlopen-crashes
make

The two test cases which fail are at the end of the output:

Outer dlopen(libinit), inner dlopen(libinit):
LD_AUDIT=./auditor.so ./main
[main] Dlopening libinit...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libinit...
Inconsistency detected by ld.so: dl-open.c: 627: dl_open_worker_begin: Assertion `r_state == RT_CONSISTENT' failed!
make: [Makefile:19: test] Error 127 (ignored)

Outer dlopen(libwrap), inner dlopen(libwrap):
LD_AUDIT=./auditor-wrap.so ./main-wrap
[main] Dlopening libwrap...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libwrap...
Inconsistency detected by ld.so: dl-open.c: 627: dl_open_worker_begin: Assertion `r_state == RT_CONSISTENT' failed!
make: [Makefile:22: test] Error 127 (ignored)

The particular use case is particularly problematic for LD_AUDIT based tools which work with GPU frameworks. For instance, in HPCToolkit as part of initialization we call into libcuda.so to set up callbacks for monitoring CUDA operations. We call dlopen/dlsym to access the libcuda.so API without creating a direct dependency (to prevent loading libcuda.so for non-CUDA applications). However, some application frameworks initiate CUDA operations during their init constructors, to allow us to capture these operations we initialize when libcuda.so is loaded to capture other operations of interest, such as thread creation. If the first action by an application framework's init constructor is a dlopen(libcuda.so) (seen in IBM’s XL OpenMP runtime when used by Clang for OpenMP offloading), we initialize during this call and recursively dlopen(libcuda.so), and subsequently crash due to this bug.
Comment 1 Florian Weimer 2024-08-06 09:13:47 UTC
The recursive dlmopen hits the assert in the already-open path in dl_open_worker_begin:

  /* It was already open.  */
  if (__glibc_unlikely (new->l_searchlist.r_list != NULL))
    {
      /* Let the user know about the opencount.  */
      if (__glibc_unlikely (GLRO(dl_debug_mask) & DL_DEBUG_FILES))
	_dl_debug_printf ("opening file=%s [%lu]; direct_opencount=%u\n\n",
			  new->l_name, new->l_ns, new->l_direct_opencount);

      /* If the user requested the object to be in the global
	 namespace but it is not so far, prepare to add it now.  This
	 can raise an exception to do a malloc failure.  */
      if ((mode & RTLD_GLOBAL) && new->l_global == 0)
	add_to_global_resize (new);

      /* Mark the object as not deletable if the RTLD_NODELETE flags
	 was passed.  */
      if (__glibc_unlikely (mode & RTLD_NODELETE))
	{
	  if (__glibc_unlikely (GLRO (dl_debug_mask) & DL_DEBUG_FILES)
	      && !new->l_nodelete_active)
	    _dl_debug_printf ("marking %s [%lu] as NODELETE\n",
			      new->l_name, new->l_ns);
	  new->l_nodelete_active = true;
	}

      /* Finalize the addition to the global scope.  */
      if ((mode & RTLD_GLOBAL) && new->l_global == 0)
	add_to_global_update (new);

      const int r_state __attribute__ ((unused))
        = _dl_debug_update (args->nsid)->r_state;
      assert (r_state == RT_CONSISTENT);

I think we need to look at new->l_init_called and re-run the constructors along new->l_searchlist.r_list if new->l_init_called is false. Not sure if we want to switch to RT_CONSISTENT before that, or leave it (potentially) at RT_ADD.
Comment 2 Florian Weimer 2024-08-06 17:50:24 UTC
I've got a reproducer of the missing constructor call that doesn't even need an auditor.
Comment 3 Ben Woodard 2024-08-06 23:15:50 UTC
I talked to the original problem reporter and his opinion is that it should transition through RT_CONSISTENT rather than staying in RT_ADD. One of the reasons is there is an assumption in tools that user code such as library constructors should not be running while the linker state is in RT_ADD.
Comment 4 Florian Weimer 2024-08-07 09:01:26 UTC
(In reply to Ben Woodard from comment #3)
> I talked to the original problem reporter and his opinion is that it should
> transition through RT_CONSISTENT rather than staying in RT_ADD. One of the
> reasons is there is an assumption in tools that user code such as library
> constructors should not be running while the linker state is in RT_ADD.

My patches swap the order of the la_activity calls and the switch back to RT_CONSISTENT. This takes care of the asserts (after some other fixes …), and I think it makes sense from a conceptual point of view, too. Introducing more la_activity calls is problematic because even our limited tests, this can introduce infinite recursion that wasn't there before.
Comment 5 Florian Weimer 2024-08-07 10:07:25 UTC
Patches posted:

[PATCH 0/4] Fixes for recursive dlopen (bug 31986)
<https://inbox.sourceware.org/libc-alpha/cover.1723024001.git.fweimer@redhat.com/>
Comment 6 Carlos O'Donell 2024-09-06 13:55:58 UTC
v2 posted by Florian:
https://patchwork.sourceware.org/project/glibc/list/?series=37208
Comment 7 Ben Woodard 2024-09-19 19:53:15 UTC
I think that I may need a bit more for this to be a complete fix of the problem.

I built a local version of the trunk with:
fae459a273 (HEAD -> fw-fixes) elf: Signal RT_CONSISTENT after relocation processing in dlopen (bug 31986)
be59ac60e3 elf: Signal LA_ACT_CONSISTENT to auditors after RT_CONSISTENT switch
00cdcdfe1a elf: Run constructors on cyclic recursive dlopen (bug 31986)
edf36ee9ab elf: Reorder audit events in dlcose to match _dl_fini (bug 32066)
d5167014b6 elf: Call la_objclose for proxy link maps in _dl_fini (bug 32065)
7bd0d8585d elf: Signal la_objopen for the proxy link map in dlmopen (bug 31985)
e36412841b elf: Update DSO list, write audit log to elf/tst-audit23.out
e64a1e81aa (origin/master, origin/HEAD, master) tst: Extend cross-test-ssh.sh to support passing glibc tunables

and though it makes it through the first few test cases and gets farther than when I originally filed the bug it doesn't make it through all of them.

[ben@darkstar build]$ ./testrun.sh /usr/bin/make -C ../../test/auditor-tests/tier2/recursive-dlopen-crashes
make: Entering directory '/home/ben/Shared/Work/test/auditor-tests/tier2/recursive-dlopen-crashes'

All tests below require lines end in OK (not FAIL and no error)

Outer dlopen(libwrap), inner dlopen(libinit):
LD_AUDIT=./auditor.so ./main-wrap
[main] Dlopening libwrap...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libinit...
  [libinit] Initializing... OK
[audit -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK
[main -> libwrap] Validating libinit has initialized...
[libwrap -> libinit] Passing validation down to libinit...
  [libinit] Checking if initialized... OK

Outer libinit preloaded, inner dlopen(libinit):
LD_PRELOAD=./libinit.so LD_AUDIT=./auditor.so ./main
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libinit...
  [libinit] Initializing... OK
[audit -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK
[main] Dlopening libinit...
[main -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK

Outer libinit loaded by main dependency, inner dlopen(libinit):
LD_AUDIT=./auditor.so ./main-init
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libinit...
  [libinit] Initializing... OK
[audit -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK
[main] Dlopening libinit...
[main -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK

Outer dlopen(libinit), inner dlopen(libwrap):
LD_AUDIT=./auditor-wrap.so ./main
[main] Dlopening libinit...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libwrap...
  [libinit] Initializing... OK
[audit -> libwrap] Validating libinit has initialized...
[libwrap -> libinit] Passing validation down to libinit...
  [libinit] Checking if initialized... OK
[main -> libinit] Validating libinit has initialized...
  [libinit] Checking if initialized... OK

Outer dlopen(libinit), inner dlopen(libinit):
LD_AUDIT=./auditor.so ./main
[main] Dlopening libinit...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libinit...
Inconsistency detected by ld.so: dl-open.c: 624: dl_open_worker_begin: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!
make: [Makefile:19: test] Error 127 (ignored)

Outer dlopen(libwrap), inner dlopen(libwrap):
LD_AUDIT=./auditor-wrap.so ./main-wrap
[main] Dlopening libwrap...
[audit] libinit has been loaded (but not initialized)
[audit] First CONSISTENT with libinit, dlopening libwrap...
Inconsistency detected by ld.so: dl-open.c: 624: dl_open_worker_begin: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!
make: [Makefile:22: test] Error 127 (ignored)
make: Leaving directory '/home/ben/Shared/Work/test/auditor-tests/tier2/recursive-dlopen-crashes'

Is there a patch that I missed? Or some other set of patches that I need to apply first?
Comment 8 Ben Woodard 2024-09-19 23:23:09 UTC
A more precise way to run the specific test is:

[ben@darkstar build2]$ LD_AUDIT=../../test/auditor-tests/tier2/recursive-dlopen-crashes/auditor.so ./testrun.sh ../../test/auditor-tests/tier2/recursive-dlopen-crashes/main
[main] Dlopening libinit...
[main -> libinit] Validating libinit has initialized...
Segmentation fault (core dumped)

GDB doesn't give us any deeper insights. 

$ LD_LIBRARY_PATH=./nptl_db gdb -ex "set env GCONV_PATH=./iconvdata" -ex "set env LOCPATH=./localedata" -ex "set env LC_ALL=C" -ex "set arg --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./nptl ../../test/auditor-tests/tier2/recursive-dlopen-crashes/main" -ex "set env LD_AUDIT=../../test/auditor-tests/tier2/recursive-dlopen-crashes/auditor.so" ./elf/ld-linux-x86-64.so.2
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-13.el9
<snip>
Reading symbols from ./elf/ld-linux-x86-64.so.2...
(gdb) r
Starting program: /home/ben/Shared/Work/glibc/build2/elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./nptl ../../test/auditor-tests/tier2/recursive-dlopen-crashes/main
warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000
warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000
warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000
warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "./nptl_db/libthread_db.so.1".
[main] Dlopening libinit...
warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000
[main -> libinit] Validating libinit has initialized...

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000004011b5 in ?? ()
#2  0x0000000000000000 in ?? ()

A clue could be the "warning: Corrupted shared library list: 0x0 != 0x7ffff7fc9000"
As with PR31985 It is accepted that this could easily be user error on my part. I just wanted to try the patches with the reproducers before they were applied upstream.