Bug 29039 - Corrupt DTV after reuse of a TLS module ID following dlclose with unused TLS
Summary: Corrupt DTV after reuse of a TLS module ID following dlclose with unused TLS
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.35
: P2 normal
Target Milestone: 2.39
Assignee: Not yet assigned to anyone
URL: https://gitlab.freedesktop.org/mesa/m...
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-10 05:36 UTC by alex_y_xu
Modified: 2023-12-22 17:00 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description alex_y_xu 2022-04-10 05:36:08 UTC
Many users have reported that Mesa 22 causes GTK4 programs to segfault on start with the crocus driver under X. This is caused by _dl_tlsdesc_dynamic returning a large negative value, which when added to FS results in a value slightly above zero. Unfortunately, I have been unable to find any reduced test case, even after testing many sequences of dlopen and dlclose calls. Additionally, the issue reportedly does not affect Fedora Linux. The shortest reproduction steps I have found are:

1. Use a GPU supported by the Mesa crocus driver; all i915 through Haswell Intel GPUs should be supported.

2. podman run -it --privileged --net=host -v /tmp/.X11-unix:/tmp/.X11-unix --rm archlinux. It may be possible to use docker instead of podman; I did not test this. Alternatively, install Arch Linux.

3. Run:

pacman -Syu
pacman -U https://archive.archlinux.org/packages/m/mesa/mesa-22.0.1-2-x86_64.pkg.tar.zst
pacman -S gnome-chess
useradd -m -u 1000 -g 1000 user # set to your host UID/GID
su - user
DISPLAY=:0 gnome-chess

This should segfault in lookup_opcode_desc with a small pointer dereference which was computed in brw_opcode_desc from a call to _dl_tlsdesc_dynamic. Unfortunately, there are no symbols for Mesa, but there are dynamic symbols for glibc. Debug output:

$ DISPLAY=:0 gdb gnome-chess
[ ... ]
(gdb) b _dl_tlsdesc_dynamic
Breakpoint 1 at 0x7ffff7fdc600
(gdb) r
[ ... ]
Thread 1 "gnome-chess" hit Breakpoint 1, 0x00007ffff7fdc600 in _dl_tlsdesc_dynamic () from /lib64/ld-linux-x86-64.so.2
(gdb) info reg
rax            0x7fffebe36898      140737150937240
rbx            0x7fffffff7e60      140737488322144
rcx            0x0                 0
rdx            0x7fffffff81a0      140737488322976
rsi            0x555556426d00      93825007774976
rdi            0x7fffffff8190      140737488322960
rbp            0x7fffffff8310      0x7fffffff8310
rsp            0x7fffffff7e58      0x7fffffff7e58
r8             0x5555561fec90      93825005513872
r9             0x1                 1
r10            0x7fffebb248c0      140737147717824
r11            0x91df835f16e6916b  -7935479573974183573
r12            0x5555564c0a60      93825008405088
r13            0x0                 0
r14            0x7fffffff87b0      140737488324528
r15            0x7fffffff82c0      140737488323264
rip            0x7ffff7fdc600      0x7ffff7fdc600 <_dl_tlsdesc_dynamic>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) fin
Run till exit from #0  0x00007ffff7fdc600 in _dl_tlsdesc_dynamic () from /lib64/ld-linux-x86-64.so.2
[Thread 0x7fffe9571640 (LWP 921) exited]
[Thread 0x7fffe9d72640 (LWP 920) exited]
0x00007fffeb3d8979 in ?? () from /usr/lib/dri/crocus_dri.so
(gdb) p $rax
$2 = -140737272884288
(gdb) p $rax+$fs_base
$3 = 0

Exporting LD_PRELOAD=/usr/local/lib/dri/crocus_dri.so avoids the crash, as static TLS is used in that case. More interestingly, exporting LD_PRELOAD=/usr/lib/libLLVM.so also avoids the crash, despite still using dynamic TLS. Exporting LD_BIND_NOW=1 does not avoid the crash.
Comment 1 Florian Weimer 2022-04-12 11:27:05 UTC
Is there a way to get glibc debugging information on archlinux? I enabled debuginfod, and it downloaded some debugging information, but not for glibc.

It looks like the fast path in _dl_tlsdesc_dynamic is taken, and I need to check what the data structures look like.
Comment 2 alex_y_xu 2022-04-12 15:18:31 UTC
(In reply to Florian Weimer from comment #1)
> Is there a way to get glibc debugging information on archlinux? I enabled
> debuginfod, and it downloaded some debugging information, but not for glibc.
> 
> It looks like the fast path in _dl_tlsdesc_dynamic is taken, and I need to
> check what the data structures look like.

Thanks for looking into this!

As far as I know, Arch Linux currently doesn't have any public debug symbols for the distro-packaged glibc. If you're more familiar with Ubuntu, that may be preferable. It was originally reported on Ubuntu, but I had some issues installing old Mesa packages on Ubuntu, whereas it is a single command on Arch. I think Ubuntu has debug symbols for glibc though. If you'd like to continue using Arch, following these steps should build and install a standard glibc package with debug symbols:

sed -i -e '/^BUILDENV=/s/check/!check/' -e '/^OPTIONS=/s/!debug/debug/' -e 's/^#MAKEFLAGS="-j2"$/MAKEFLAGS="-j'$(nproc)'"' /etc/makepkg.conf
pacman -S base-devel asp sudo
sed -i -e 's/# %wheel ALL=(ALL:ALL) NOPASSWD: ALL/%wheel ALL=(ALL:ALL) NOPASSWD: ALL/' /etc/sudoers
su - user
asp checkout glibc
cd glibc/trunk
gpg --recv-keys 16792B4EA25340F8
makepkg -si

I tested approximately this method and was able to reproduce the issue on bare metal. Alternatively, it may be possible to manually install glibc with ./configure; make; make install. I didn't test this method; it may be necessary to source /etc/makepkg.conf; export CFLAGS LDFLAGS in order to reproduce the issue.
Comment 3 Sam James 2023-11-26 14:17:58 UTC
This has come up at https://bugzilla.redhat.com/show_bug.cgi?id=2251557 in the context of Asahi Linux (porting Linux & userland to the apple arm macs).

See https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/7199 as well.
Comment 4 Sourceware Commits 2023-11-28 17:29:28 UTC
The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=3921c5b40f293c57cb326f58713c924b0662ef59

commit 3921c5b40f293c57cb326f58713c924b0662ef59
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Comment 5 Sourceware Commits 2023-12-20 08:46:06 UTC
The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=980450f12685326729d63ff72e93a996113bf073

commit 980450f12685326729d63ff72e93a996113bf073
Author: Szabolcs Nagy <szabolcs.nagy@arm.com>
Date:   Wed Nov 29 11:31:37 2023 +0000

    elf: Add TLS modid reuse test for bug 29039
    
    This is a minimal regression test for bug 29039 which only affects
    targets with TLSDESC and a reproducer requires that
    
    1) Have modid gaps (closed modules) with old generation.
    2) Update a DTV to a newer generation (needs a newer dlopen).
    3) But do not update the closed gap entry in that DTV.
    4) Reuse the modid gap for a new module (another dlopen).
    5) Use dynamic TLSDESC in that new module with old generation (bug).
    6) Access TLS via this TLSDESC and the now outdated DTV.
    
    However step (3) in practice rarely happens: during DTV update the
    entries for closed modids are initialized to "unallocated" and then
    dynamic TLSDESC calls __tls_get_addr independently of its generation.
    The only exception to this is DTV setup at thread creation (gaps are
    initialized to NULL instead of unallocated) or DTV resize where the
    gap entries are outside the previous DTV array (again NULL instead
    of unallocated, and this requires loading > DTV_SURPLUS modules).
    
    So the bug can only cause NULL (+ offset) dereference, not use after
    free. And the easiest way to get (3) is via thread creation.
    
    Note that step (5) requires that the newly loaded module has larger
    TLS than the remaining optional static TLS. And for (6) there cannot
    be other TLS access or dlopen in the thread that updates the DTV.
    
    Tested on aarch64-linux-gnu.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Comment 6 Sam James 2023-12-22 14:43:37 UTC
Please cherry-pick to 2.38 at least (along with the test commit).

Anyway, closing given this is fixed for 2.39.
Comment 7 Sourceware Commits 2023-12-22 16:58:52 UTC
The release/2.38/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ccdc4cba07684fe1397e1f5f134a0a827af98c04

commit ccdc4cba07684fe1397e1f5f134a0a827af98c04
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
    (cherry picked from commit 3921c5b40f293c57cb326f58713c924b0662ef59)
Comment 8 Sourceware Commits 2023-12-22 16:58:57 UTC
The release/2.38/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0de9082ed8d8f149ca87d569a73692046e236c18

commit 0de9082ed8d8f149ca87d569a73692046e236c18
Author: Szabolcs Nagy <szabolcs.nagy@arm.com>
Date:   Wed Nov 29 11:31:37 2023 +0000

    elf: Add TLS modid reuse test for bug 29039
    
    This is a minimal regression test for bug 29039 which only affects
    targets with TLSDESC and a reproducer requires that
    
    1) Have modid gaps (closed modules) with old generation.
    2) Update a DTV to a newer generation (needs a newer dlopen).
    3) But do not update the closed gap entry in that DTV.
    4) Reuse the modid gap for a new module (another dlopen).
    5) Use dynamic TLSDESC in that new module with old generation (bug).
    6) Access TLS via this TLSDESC and the now outdated DTV.
    
    However step (3) in practice rarely happens: during DTV update the
    entries for closed modids are initialized to "unallocated" and then
    dynamic TLSDESC calls __tls_get_addr independently of its generation.
    The only exception to this is DTV setup at thread creation (gaps are
    initialized to NULL instead of unallocated) or DTV resize where the
    gap entries are outside the previous DTV array (again NULL instead
    of unallocated, and this requires loading > DTV_SURPLUS modules).
    
    So the bug can only cause NULL (+ offset) dereference, not use after
    free. And the easiest way to get (3) is via thread creation.
    
    Note that step (5) requires that the newly loaded module has larger
    TLS than the remaining optional static TLS. And for (6) there cannot
    be other TLS access or dlopen in the thread that updates the DTV.
    
    Tested on aarch64-linux-gnu.
    
    Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
    (cherry picked from commit 980450f12685326729d63ff72e93a996113bf073)
Comment 9 Sourceware Commits 2023-12-22 16:59:36 UTC
The release/2.37/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=874d4186975560fb79d5ebd46a4f378a2e3f7657

commit 874d4186975560fb79d5ebd46a4f378a2e3f7657
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
    (cherry picked from commit 3921c5b40f293c57cb326f58713c924b0662ef59)
Comment 10 Sourceware Commits 2023-12-22 16:59:56 UTC
The release/2.36/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=882a991620fcf2ecb3f623e2d29ac551b33bd6ee

commit 882a991620fcf2ecb3f623e2d29ac551b33bd6ee
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
    (cherry picked from commit 3921c5b40f293c57cb326f58713c924b0662ef59)
Comment 11 Sourceware Commits 2023-12-22 17:00:11 UTC
The release/2.35/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=5f08ec08d03930050befec16fcc6264fa00c66fe

commit 5f08ec08d03930050befec16fcc6264fa00c66fe
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
    (cherry picked from commit 3921c5b40f293c57cb326f58713c924b0662ef59)
Comment 12 Sourceware Commits 2023-12-22 17:00:37 UTC
The release/2.34/master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f95fe7060895bfe28ea5bdf8de240e01c1dea097

commit f95fe7060895bfe28ea5bdf8de240e01c1dea097
Author: Hector Martin <marcan@marcan.st>
Date:   Tue Nov 28 15:23:07 2023 +0900

    elf: Fix TLS modid reuse generation assignment (BZ 29039)
    
    _dl_assign_tls_modid() assigns a slotinfo entry for a new module, but
    does *not* do anything to the generation counter. The first time this
    happens, the generation is zero and map_generation() returns the current
    generation to be used during relocation processing. However, if
    a slotinfo entry is later reused, it will already have a generation
    assigned. If this generation has fallen behind the current global max
    generation, then this causes an obsolete generation to be assigned
    during relocation processing, as map_generation() returns this
    generation if nonzero. _dl_add_to_slotinfo() eventually resets the
    generation, but by then it is too late. This causes DTV updates to be
    skipped, leading to NULL or broken TLS slot pointers and segfaults.
    
    Fix this by resetting the generation to zero in _dl_assign_tls_modid(),
    so it behaves the same as the first time a slot is assigned.
    _dl_add_to_slotinfo() will still assign the correct static generation
    later during module load, but relocation processing will no longer use
    an obsolete generation.
    
    Note that slotinfo entry (aka modid) reuse typically happens after a
    dlclose and only TLS access via dynamic tlsdesc is affected. Because
    tlsdesc is optimized to use the optional part of static TLS, dynamic
    tlsdesc can be avoided by increasing the glibc.rtld.optional_static_tls
    tunable to a large enough value, or by LD_PRELOAD-ing the affected
    modules.
    
    Fixes bug 29039.
    
    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
    (cherry picked from commit 3921c5b40f293c57cb326f58713c924b0662ef59)