Bug 31717 - TLS for library gets reallocated when loaded in two contexts.
Summary: TLS for library gets reallocated when loaded in two contexts.
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.39
: P2 normal
Target Milestone: 2.41
Assignee: Florian Weimer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-09 13:44 UTC by Ben Woodard
Modified: 2024-09-06 14:00 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security?


Attachments
reproducer (1.28 KB, application/x-xz)
2024-05-09 13:44 UTC, Ben Woodard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Woodard 2024-05-09 13:44:35 UTC
Created attachment 15507 [details]
reproducer

When a library is loaded and relocated as part of an application and it has TLS memory that it updates and then the library is dynamically loaded with dlopen(), the second load can change the generation counter leaving the library's TLS area set as unallocated causing it to be reallocated the next time that it is used. This results in any information that had been stored in the TLS before the dlopen() of the library being lost.

This problem was reported by a customer when using libomp and librocprofiler and in that case libomp loses the mappings to its threads.

This problem seems to have existed for quite some time. I have verified that it exists as far back as glibc-2.28 in RHEL8 and it still exists in the latest glibc-2.39 found in rawhide. In other words it seems like practically all versions of glibc are affected. 

The sequence of operations is as follows:

    Libraries A and B are loaded and relocated
    A's init constructor is called:
        Inside, A calls a function that resolves to B
        B accesses and alters its TLS
    B is then dlopen()'d by "C" (which may be A or B or neither)
        Inside, Glibc advances the generation counter and marks B's TLS as "unallocated"
    B accesses its TLS again, changes from before are lost

In the failing case (audit + rocprof), B is libhpcrun.so, A is libomp.so and "C" is librocprofiler.so. In the suspicious case (no audit + rocprof), B is libomp.so, and A and "C" are libomptarget.so.

Sometimes this bug is masked by the fact that B's TLS is a static block and so even though its TLS gets "reallocated" in the middle it gets the same memory back and it isn't reinitialized in between, so it looks like nothing happened. In other cases, the library whose TLS gets reallocated is written robustly enough that it simply reinitializes its TLS data and the only apparent effect is a loss of allocated memory.
Comment 1 Florian Weimer 2024-05-13 08:20:54 UTC
The reproducer is a bit dodgy because it calls B_init in libB.so before the ELF constructor of libB.so has run. This exposed a bug in dlopen because it uses l_init_called as a proxy to decide whether to do TLS processing:

/* Call _dl_add_to_slotinfo with DO_ADD set to false, to allocate
   space in GL (dl_tls_dtv_slotinfo_list).  This can raise an
   exception.  The return value is true if any of the new objects use
   TLS.  */
static bool
resize_tls_slotinfo (struct link_map *new)
{
  bool any_tls = false;
  for (unsigned int i = 0; i < new->l_searchlist.r_nlist; ++i)
    {
      struct link_map *imap = new->l_searchlist.r_list[i];

      /* Only add TLS memory if this object is loaded now and
         therefore is not yet initialized.  */
      if (! imap->l_init_called && imap->l_tls_blocksize > 0)
        {
          _dl_add_to_slotinfo (imap, false);
          any_tls = true;
        }
    }
  return any_tls;
}

The recursive dlopen call runs this code again (before invoking the libB.so ELF constructor), overwriting the TLS slot for the object with a newer generation ID. This triggers reallocation of the TLS block in the next _tls_get_addr call.

Curiously, this is not a regression introduced by bug 20839, but maybe there are more cases now where this bug occurs.
Comment 2 Florian Weimer 2024-05-13 09:56:21 UTC
Technically, this is a use-after-free bug, although in the reproducer, the TLS block is comes back from malloc with the same address.
Comment 3 Florian Weimer 2024-05-13 17:40:27 UTC
Patch posted:

[PATCH] elf: Avoid re-initializing already allocated TLS in dlopen (bug 31717)
<https://inbox.sourceware.org/libc-alpha/878r0dmx0d.fsf@oldenburg.str.redhat.com/>
Comment 4 Florian Weimer 2024-09-06 14:00:02 UTC
Fixed for glibc 2.41 via:

commit 5097cd344fd243fb8deb6dec96e8073753f962f9
Author: Florian Weimer <fweimer@redhat.com>
Date:   Thu Aug 1 23:31:30 2024 +0200

    elf: Avoid re-initializing already allocated TLS in dlopen (bug 31717)
    
    The old code used l_init_called as an indicator for whether TLS
    initialization was complete.  However, it is possible that
    TLS for an object is initialized, written to, and then dlopen
    for this object is called again, and l_init_called is not true at
    this point.  Previously, this resulted in TLS being initialized
    twice, discarding any interim writes (technically introducing a
    use-after-free bug even).
    
    This commit introduces an explicit per-object flag, l_tls_in_slotinfo.
    It indicates whether _dl_add_to_slotinfo has been called for this
    object.  This flag is used to avoid double-initialization of TLS.
    In update_tls_slotinfo, the first_static_tls micro-optimization
    is removed because preserving the initalization flag for subsequent
    use by the second loop for static TLS is a bit complicated, and
    another per-object flag does not seem to be worth it.  Furthermore,
    the l_init_called flag is dropped from the second loop (for static
    TLS initialization) because l_need_tls_init on its own prevents
    double-initialization.
    
    The remaining l_init_called usage in resize_scopes and update_scopes
    is just an optimization due to the use of scope_has_map, so it is
    not changed in this commit.
    
    The isupper check ensures that libc.so.6 is TLS is not reverted.
    Such a revert happens if l_need_tls_init is not cleared in
    _dl_allocate_tls_init for the main_thread case, now that
    l_init_called is not checked anymore in update_tls_slotinfo
    in elf/dl-open.c.
    
    Reported-by: Jonathon Anderson <janderson@rice.edu>
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>