Bug 28937 - New DSO dependency sorter does not put new map first if in a cycle
Summary: New DSO dependency sorter does not put new map first if in a cycle
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.35
: P2 normal
Target Milestone: 2.37
Assignee: Florian Weimer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-03 12:48 UTC by Pieter-Jan Briers
Modified: 2023-08-22 10:49 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2022-08-08 00:00:00
fweimer: security-


Attachments
Both log files, since it seems I can't attach multiple files on bugzilla? (6.27 KB, application/x-gzip)
2022-03-03 12:48 UTC, Pieter-Jan Briers
Details
ldso-assert.patch (3.87 KB, patch)
2022-03-03 15:27 UTC, Florian Weimer
Details | Diff
Core dump from SIGSEGV in ld (669.78 KB, application/zstd)
2022-03-03 23:25 UTC, Pieter-Jan Briers
Details
possible reproducer (775 bytes, application/x-gzip)
2022-04-11 23:19 UTC, Ben Woodard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pieter-Jan Briers 2022-03-03 12:48:15 UTC
Created attachment 14001 [details]
Both log files, since it seems I can't attach multiple files on bugzilla?

(I previously reported this on the Arch Linux bug tracker and it was decided to be an upstream issue, https://bugs.archlinux.org/task/73967)

Description:

The new new dependency resolver for dynamic library loading seems to be causing some sort of internal corruption/buggyness when paired with usage of dlclose(). I traced some bug reports[1] we were getting in our game to it after a lot of debugging, and then retroactively we found a second problem one of our maintainers had been dealing with that was caused by this. These all come from Arch Linux (I think) because they're already on the latest glibc version.

The first problem we had is that dlclose()ing the result of dlopen(NULL) seemed to cause specifically libbz2.so to erroneously get unloaded (in LD_DEBUG=libs). This then caused future dynamic loading of other libs (specifically GTK) which needed libbz2.so to break, and that caused the initial bug reports.

The full sequence of events, for context, can be best described with the following C code and comments. Note that to trigger this I compiled it with fluidsynth/ALSA/PortAudio available on an Arch Linux system, so I'm not sure this will be easy to reproduce (I failed to produce a more minimal repro). Compile instructions are just gcc main.c && ./a.out.

#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>

void main()
{
void* lib;

// Early on, we load freetype, which has a direct dependency on bzip2.
lib = dlopen("libfreetype.so.6", RTLD_NOW);
printf("#### freetype: %llX\n", lib);

// Then, much later (when the user opens the instrument menu), we load fluidsynth
lib = dlopen("libfluidsynth.so.3", RTLD_NOW);
printf("#### fluidsynth: %llX\n", lib);

// Using dlsym so we don't link directly against fluidsynth and it's dynamically loaded.
void* (*new_fluid_settings_fp)() = dlsym(lib, "new_fluid_settings");

// And initialize fluidsynth by calling new_fluid_settings()
printf("#### Initializing fluidsynth!\n");
void* settings = new_fluid_settings_fp();

// Fluidsynth will go through all its audio drivers, which ends up test-initializing all its audio drives (there's a lot), one of which being ALSA (through PortAudio when I was debugging this, didn't test if it happens directly to ALSA)
// alsa-libs/libasound.so ends up doing many sequences of dlopen(NULL) -> dlsym() -> dlclose() at config.c line 4017[2] running config hooks. This dlclose call, for some reason, results in libbz2.so getting erroneously fini'd.

printf("#### Initialized fluidsynth!\n");

// GTK3 will fail to load (be zero) if loaded after fluidsynth initialized, because libbz2.so was erroneously unloaded earlier.
// LD_DEBUG reports being unable to find the raw symbols provided by BZ2 (e.g. BZ2_hbMakeCodeLengths) which seems to imply it's only partially corrupted?
lib = dlopen("libgtk-3.so", RTLD_NOW);
printf("#### GTK3: %llX\n", lib);

return;
}

When I run this on my system right now (glibc 2.35), GTK will fail to be loaded, and LD_DEBUG=libs output shows libbz2.so getting unloaded during FluidSynth initialization. If I then run it with GLIBC_TUNABLES=glibc.rtld.dynamic_sort=1, this does not happen and GTK loads fine.

I have attached the log of me running this with LD_DEBUG=libs as mine.log. I understand there's tons of variables and distro-specific things at play to make this reliably repro, but I've had multiple anecdotes from Arch at least (my personal testing + 2 bug reports on our engine repo)

The second problem appears to be an explicit hard assert error: "Inconsistency detected by ld.so: dl-close.c: 277: _dl_close_worker: Assertion `imap->l_type == lt_loaded && !imap->l_nodelete_active' failed!". One of our maintainers started getting this "recently" (probably when Arch pushed the 2.35 glibc package) and it caused her game to fail to start most of the time (not 100% consistent). After I discovered the first thing she tried GLIBC_TUNABLES=glibc.rtld.dynamic_sort=1 and this problem also went away completely. Attached is her LD_DEBUG=libs log as vera.log in case it's useful. There is quite a lot of stuff happening in it because the game is .NET and it has to load graphics drivers and such though.

[1]: https://github.com/space-wizards/RobustToolbox/issues/2563
[2]: https://github.com/alsa-project/alsa-lib/blob/1454b5f118a3b92663923fe105daecfeb7e20f1b/src/conf.c#L3998-L4020
Comment 1 Florian Weimer 2022-03-03 15:27:08 UTC
Created attachment 14002 [details]
ldso-assert.patch

Could you please check if the assert added by this check triggers? Thanks.
Comment 2 Pieter-Jan Briers 2022-03-03 22:55:00 UTC
So I just rebooted into my Linux system to try compiling that patch only for my program to outright crash on a SIGSEGV in ld (and yes, disabling the new sort fixed it). Core dump attached but I was stupid and forgot to make a btrfs snapshot before running a pacman upgrade so I can't get this to happen again.
Comment 3 Pieter-Jan Briers 2022-03-03 23:25:20 UTC
Created attachment 14004 [details]
Core dump from SIGSEGV in ld
Comment 4 Pieter-Jan Briers 2022-03-04 01:07:22 UTC
The assert in that patch does not appear to be getting triggered, from what I can tell.
Comment 5 Ben Woodard 2022-04-11 23:19:44 UTC
Created attachment 14059 [details]
possible reproducer

I'm pretty sure that this reproducer is manifesting the same problem. So far we have observed this with the debian beta and with fedora 36 beta both of which use glibc 2.35

This only seems to happen with some libraries which have some dependencies. We have not yet been able to figure out a common denominator about which library dependencies seem to trigger the problem. However we did find that one library which is part of the distro which reliably seemed to be frequently involved in the failing test cases, hwloc.

LD_PRELOAD the hwloc library prevents the problem from happening.
Comment 6 Florian Weimer 2022-04-12 10:29:24 UTC
(In reply to Ben Woodard from comment #5)
> Created attachment 14059 [details]
> possible reproducer
> 
> I'm pretty sure that this reproducer is manifesting the same problem. So far
> we have observed this with the debian beta and with fedora 36 beta both of
> which use glibc 2.35

Thank you for helping with this bug. However, this reproducer is something else: it's an incorrect putenv call with a static string argument in hwloc. So it's not a glibc bug.

(Note that the reproducer needs to be linked with -Wl,--no-as-needed in case the link editor defaults to as-needed mode.)
Comment 7 Florian Weimer 2022-08-08 10:38:18 UTC
We still cannot reproduce this bug on Fedora, as far as I know. Has anyone made progress debugging this further?
Comment 8 Pieter-Jan Briers 2022-08-14 20:06:50 UTC
I made an Arch Linux VM image that reliably reproduces the bug, in the hope that it helps debugging. It's about 800 MB XZ-compressed, 3.7 GB without. I made the VM in Hyper-V and converted the .vhdx to .qcow2 in WSL2 though, so I hope that didn't break anything.

Here's the file: https://central.spacestation14.io/ArchText.qcow2.xz
Login is just root with password "root". Both the broken and working versions of the sample code (one with the fluidsynth load commented out) are included in the /root. 

One of the commenters on the downstream Arch issue mentioned they ran valgrind and it was *not* happy, and I can definitely confirm this. I didn't try running valgrind before making the VM image though, and I didn't want to sit through another 10 minutes of xz compressing a massive file, so here's the instructions to get valgrind set up:

pacman -S valgrind debuginfod
export DEBUGINFOD_URLS="https://debuginfod.archlinux.org" # Necessary because of https://bbs.archlinux.org/viewtopic.php?id=276422
valgrind --leak-check=yes ./broken

Valgrind does complain about some stuff in the audio libraries themselves, but there is still noticeably a couple errors inside dlclose() itself.
Comment 9 Florian Weimer 2022-08-15 09:23:51 UTC
I could get it to boot in UEFI mode, but the disk has to be attached as sata (not virtio). Thank you for putting together this reproducer, it is really helpful.

The situation matches the one described in comment 0. There is no dlclose call for libfreetype.so.6, so libbz2.so.1.0 really can't been unloaded.

A comment in _dl_close_worker alludes to the root cause:

	  /* We are always the zeroth entry, and since we don't include
	     ourselves in the dependency analysis start at 1.  */
	  struct link_map **lp = &l->l_initfini[1];

The assumption is not correct in this case, l_initfini[0] is actually libbz2.so.1.0 here. The reason is that the new algorithm does not implement skipping, and due to the cyclic dependency between libfreetype.so.6 and libharfbuzz.so.0, libfreetype.so.6 is not sorted first naturally.

I think it makes sense to always initialize the newly-loaded object first, but as the comment in _dl_sort_maps_dfs explains, that algorithm cannot skip the dependencies directly. But I think we should adjsut _dl_sort_maps_dfs because the initialization order property seems useful. I'll try to come up with a patch.
Comment 10 Florian Weimer 2022-09-20 09:07:45 UTC
Fixed for 2.37 via:

commit 1df71d32fe5f5905ffd5d100e5e9ca8ad6210891
Author: Florian Weimer <fweimer@redhat.com>
Date:   Tue Sep 20 11:00:42 2022 +0200

    elf: Implement force_first handling in _dl_sort_maps_dfs (bug 28937)
    
    The implementation in _dl_close_worker requires that the first
    element of l_initfini is always this very map (“We are always the
    zeroth entry, and since we don't include ourselves in the
    dependency analysis start at 1.”).  Rather than fixing that
    assumption, this commit adds an implementation of the force_first
    argument to the new dependency sorting algorithm.  This also means
    that the directly dlopen'ed shared object is always initialized last,
    which is the least surprising behavior in the presence of cycles.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

I'll backport this as applicable.