Bug 30765 - Recursive library loading problem when using glibc probes
Summary: Recursive library loading problem when using glibc probes
Status: NEW
Alias: None
Product: gdb
Classification: Unclassified
Component: shlibs (show other bugs)
Version: HEAD
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 30766
Blocks:
  Show dependency treegraph
 
Reported: 2023-08-15 13:43 UTC by Andrew Burgess
Modified: 2023-08-15 14:31 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
GDB test case that exposes the issue described in this bug. (5.08 KB, patch)
2023-08-15 13:43 UTC, Andrew Burgess
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Burgess 2023-08-15 13:43:55 UTC
Created attachment 15060 [details]
GDB test case that exposes the issue described in this bug.

This bug describes an issues that exists with the mechanism GDB uses to detect shared library loading, specifically, with glibc's probe interface.  I think the real problem is with glibc, though it maybe possible that we can work around this issue in GDB, but I'm not sure how yet.

The attached patch applies to current(ish) HEAD of GDB (86dfe011797) and adds a test which shows the problem, when run I see results like this:

		=== gdb Summary ===

# of expected passes		4
# of known failures		3

Below is the description of the bug taken from the commit message included in the patch:

    gdb/testsuite: expose issue with recursive dlopen
    
    This commit exposes an issue with GDB's handling of recursive dlopen.
    The bug is actually an issue in glibc, but I'm creating this patch so
    that I can file a GDB bug, which I'll then reference from a glibc bug.
    
    The bug is actually in glibc's reloc_complete probe, which the glibc
    documentation describes like this:
    
      reloc_complete:
        The linker has relocated all objects in the specified namespace.
        The namespace's r_debug structure is consistent and may be
        inspected, and all objects in the namespace's link-map are
        guaranteed to have been relocated.
    
    In this test we create a situation where a recursive dlopen occurs.
    This is done by overriding malloc.
    
    Inside the overridden malloc we dlopen a library (libbar) and call a
    function from within that library, we then dlclose the library.  Care
    is taken so that we don't trigger this behaviour recursively, if the
    dlopen, call, dlclose sequence used within malloc triggers another
    malloc, then, in that case, we just forward the request straight
    through to malloc.
    
    Now, in the main() function we dlopen a different library (libfoo),
    call a function within it, and then dlclose the library.  There is no
    recursion protection here.  And so, the basic sequence of events is:
    
      In main, dlopen libfoo
        dlopen calls malloc
          In malloc, dlopen libbar
            dlopen calls malloc
              In malloc, allocate memory and return
            dlopen for libbar completes
          In malloc, call function from libbar
          In malloc, dlclose libbar
          In malloc, allocate memory and return
        dlopen for libfoo completes
      In main, call function from libfoo
      In main, dlclose libfoo
    
    It's not quite that simple, it turns out that dlopen calls malloc a
    number of times, and so we actually see repeated calls into malloc
    that each result in libbar being loaded, called, and closed.
    
    Within glibc, as each library is loaded, we pass through a number of
    probes:
    
      - map_start
      - map_complete
      - reloc_start
      - reloc_complete
    
    GDB only cares about the 'reloc_complete' probe, which is hit when all
    the libraries have been mapped and relocated.
    
    At some point after map_start the new library is added to the shared
    library list, but is not yet relocated.  Only when reloc_complete is
    hit are we guaranteed that all libraries have been relocated...
    
    The problem is, glibc calls malloc at some point between map_start and
    reloc_complete.  This call to malloc triggers the recursive dlopen.
    This recursive dlopen passes through all these probes, which means
    that GDB will be triggered by the reloc_complete probe.
    
    When the reloc_complete probe is hit the following things happen:
    
    First, GDB tries to only load information about the most recently
    added libraries.  To do this GDB tracks the known library list.  When
    reloc_complete is hit glibc passes GDB a pointer to the new library,
    which is part of a doubly linked list.
    
    GDB follows the back pointer for the new library and expects the
    previous library to be the last library that GDB knows was loaded.
    However, in our problem case this is not true.  The first
    library (libfoo) has already been added to the library list, but has
    not yet been announced (with reloc_complete) to GDB yet.  GDB is
    seeing the reloc_complete probe for libbar.  However, within glibc's
    data structure, the previous library is libfoo, and this is why we see
    the following warning from GDB:
    
      warning: Corrupted shared library list: 0x7ffff7ffd988 != 0x405ee0
    
    Now, when GDB emits that warning it falls back to performing a
    complete reload of all the shared libraries.  This is done by walking
    glibc's data structure to find all the libraries.  This will include
    libfoo, which has not yet been relocated.
    
    Unfortunately, there is nothing in glibc's data structure (that is
    visible to GDB) that can tell us that libfoo is not yet relocated, as
    a result, GDB will believe that libfoo has been fully relocated, and
    will announce the library to the user.
    
    This test shows that the library is not fully relocated by stopping on
    the solib event, watching for GDB to tell us that libfoo has been
    loaded, and then prints a global variable from within the library.
    
    The global variable happens to be initialised with a pointer value,
    and so will not be correct unless relocation has been performed.  As
    we see, GDB can observe the global in an uninitialised state.
    
    I don't know if there are wider implications from GDB seeing the
    library load earlier than it should, we can, for sure, load the debug
    information at this point -- could we get anything wrong as a result
    of relocation having not been completed yet?  We could potentially
    trigger the loading of Python extensions from the library, this for
    sure could run into problems if the Python code reads any globals that
    it expects to be initialised.
    
    In terms of fixing this, the only options I see would require GDB to
    be _more_ trusting of glibc, and even then, I don't think the solution
    would be perfect.  We could track the reloc_start/reloc_completed
    pairs to try and track recursion, and thus ignore libraries that have
    not been relocated yet, but this would mean we could not fall back (as
    we currently do) to just "reload everything", when we see some
    unexpected state -- as "everything" can include libraries that are not
    relocated yet.
    
    Also, if we attach to a process we're stuck, the only option is to
    walk the library list and "reload everything", but at that point we
    might end up finding a library that is not relocated yet.
    
    Ultimately, the right solution is for glibc to ensure that we really
    do only add the library to the library list just prior to hitting the
    reloc_complete probe.
    
    Well, to maintain the existing API, I think glibc would need to add
    the library to the list just prior to map_complete, then remove the
    library again just after reloc_start, before adding the libraries
    again at reloc_complete -- which really sucks.  Or maybe glibc needs
    to be smarter and "preallocate" its required memory ahead of time
    before mapping and relocating the library...
Comment 1 Andrew Burgess 2023-08-15 14:31:17 UTC
I created glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30766 for the glibc side of this issue.