Summary: | dl-tls.c assert failure at concurrent pthread_create and dlopen | ||
---|---|---|---|
Product: | glibc | Reporter: | Szabolcs Nagy <nszabolcs> |
Component: | dynamic-link | Assignee: | Szabolcs Nagy <nszabolcs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | azhelev+sourceware, bogdan.ilchyshyn, carlos, chan45.lee, dongkyun.s, fweimer, hk0110.choi, jg, jrmuizel, lilydjwg, louis.benazet, lukasz.koniecki, lvying.system.thoughts, markus, mavant, mgulick, mmezeul, nsz, P, robert, sebastien, sh0924.hwang, shanzhikun, sjkingo88, slyich, sourceware.org, sourceware.org, stli, thomas.perret+glibc, toolchain, xujing99 |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.22 | ||
Target Milestone: | 2.34 | ||
See Also: |
https://sourceware.org/bugzilla/show_bug.cgi?id=19924 https://sourceware.org/bugzilla/show_bug.cgi?id=28357 https://sourceware.org/bugzilla/show_bug.cgi?id=27135 |
||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: |
test case (main module)
test case (build script) |
Description
Szabolcs Nagy
2015-12-04 12:37:03 UTC
Hi, I've suggested a patch for this bug: https://sourceware.org/ml/libc-alpha/2015-12/msg00570.html Created attachment 8893 [details]
test case (main module)
Created attachment 8894 [details]
test case (build script)
assigned this to myself, will work on it for 2.24, the current latest patch is https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html Is this patch still being reviewed? The last update I see is https://sourceware.org/ml/libc-alpha/2016-01/msg00620.html, but I'm not familiar with how issue tracking works for this project so I could easily have missed something... I sometimes see the same failure during make check: env GCONV_PATH=/var/tmp/glibc-build/iconvdata LOCPATH=/var/tmp/glibc-build/localedata LC_ALL=C /var/tmp/glibc-build/elf/ld-linux-x86-64.so.2 --library-path /var/tmp/glibc-build:/var/tmp/glibc-build/math:/var/tmp/glibc-build/elf:/var/tmp/glibc-build/dlfcn:/var/tmp/glibc-build/nss:/var/tmp/glibc-build/nis:/var/tmp/glibc-build/rt:/var/tmp/glibc-build/resolv:/var/tmp/glibc-build/crypt:/var/tmp/glibc-build/mathvec:/var/tmp/glibc-build/support:/var/tmp/glibc-build/nptl /var/tmp/glibc-build/nptl/tst-stack4 > /var/tmp/glibc-build/nptl/tst-stack4.out; \ ../scripts/evaluate-test.sh nptl/tst-stack4 $? false false > /var/tmp/glibc-build/nptl/tst-stack4.test-result Inconsistency detected by ld.so: dl-tls.c: 488: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! This is unfortunately not reproducible. We were often hitting this issue with some multithreaded binaries with many shared libs. These patches referenced here, address the issue. Specifically: https://patches.linaro.org/patch/85007/ https://patches.linaro.org/patch/85008/ These have been _extensively_ tested here with glibc-2.23 with many binaries (In reply to Pádraig Brady from comment #7) > We were often hitting this issue with some multithreaded binaries with many > shared libs. These patches referenced here, address the issue. Specifically: > https://patches.linaro.org/patch/85007/ > https://patches.linaro.org/patch/85008/ > > These have been _extensively_ tested here with glibc-2.23 with many binaries Please repost those to libc-alpha so we can review. We found an off by one issue with this (with ASAN + certain number of shared libs). When the last vector in the _dl_allocate_tls_init list of vectors was of size one it would have been skipped. The fix is: diff --git a/elf/dl-tls.c b/elf/dl-tls.c index 073321c..2c9ad2a 100644 --- a/elf/dl-tls.c +++ b/elf/dl-tls.c @@ -571,7 +571,7 @@ _dl_allocate_tls_init (void *result) } total += cnt; - if (total >= dtv_slots) + if (total > dtv_slots) break; /* Synchronize with dl_add_to_slotinfo. */ Has there been any activity on this one lately? Does anyone know if a fix will be coming anytime soon? This has been _very_ well tested at facebook Note the additional fix in comment #9 It would be great to merge this. thanks! (In reply to Pádraig Brady from comment #11) > This has been _very_ well tested at facebook > Note the additional fix in comment #9 > It would be great to merge this. thanks! sorry i didnt have time to work on this in this release cycle, i'll try to look at it in the next one if others don't beat me to it (the comments can be improved, dtv_slots should be fixed so it has consistent meaning and one should reason about the consequences of removing the asserts, they might catch valid corruption that is still present via dlclose races that are not fixed). (In reply to Szabolcs Nagy from comment #12) > sorry i didnt have time to work on this in this release cycle, i'll try to > look at it in the next one if others don't beat me to it (the comments can > be improved, dtv_slots should be fixed so it has consistent meaning and one > should reason about the consequences of removing the asserts, they might > catch valid corruption that is still present via dlclose races that are not > fixed). Any update on this? It has been over a year since the last comment. Just want to add that the two patches posted here (and the off-by-one fix in the comments) have been running by my employer (MathWorks) on at least 1000 Debian 9 systems for the past 6 months without issue. It would be great if these patches could be accepted into glibc itself. (In reply to Mike Gulick from comment #14) > Just want to add that the two patches posted here (and the off-by-one fix in > the comments) have been running by my employer (MathWorks) on at least 1000 > Debian 9 systems for the past 6 months without issue. It would be great if > these patches could be accepted into glibc itself. None of these changes are easy to integrate because they fail to explain in detailed notes why they are correct. We take the conservative approach not to apply complete solutions. Someone seeing this problem has to take the position to champion the broader solution as positioned by Szabolcs from Arm. Alternatively someone needs to explain why the partial solution is better than no solution and champion that on libc-alpha. This testcase seems to reproduce the bug pretty reliably. https://github.com/jrmuizel/dl-open-race Sorry, it actually does not. And I see there was already a testcase posted here. https://sourceware.org/ml/libc-alpha/2016-11/msg00917.html In https://bugs.gentoo.org/719674#c12 gentoo sees nptl/tst-stack4 crashes somewhat reliably on arm64: # while :; do date; env GCONV_PATH=/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/iconvdata LOCPATH=/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/localedata LC_ALL=C /var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/elf/ld-linux-aarch64.so.1 --library-path /var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/math:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/elf:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/dlfcn:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/nss:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/nis:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/rt:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/resolv:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/mathvec:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/support:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/crypt:/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/nptl::/var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl//dlfcn /var/tmp/portage/sys-libs/glibc-2.30-r8/work/build-arm64-aarch64-unknown-linux-gnu-nptl/nptl/tst-stack4; done Sun 03 May 2020 10:42:08 AM UTC Sun 03 May 2020 10:42:21 AM UTC Sun 03 May 2020 10:42:34 AM UTC Didn't expect signal from child: got `Segmentation fault' ... Sun 03 May 2020 10:42:56 AM UTC malloc(): invalid size (unsorted) Didn't expect signal from child: got `Aborted' .. Sun 03 May 2020 10:46:21 AM UTC free(): corrupted unsorted chunks Didn't expect signal from child: got `Aborted' ... Sun 03 May 2020 10:46:55 AM UTC Didn't expect signal from child: got `Segmentation fault' Sun 03 May 2020 10:47:04 AM UTC double free or corruption (!prev) Didn't expect signal from child: got `Aborted' ... Sun 03 May 2020 10:50:54 AM UTC free(): invalid pointer Didn't expect signal from child: got `Aborted' ... Sun 03 May 2020 10:52:12 AM UTC tst-stack4: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed. Didn't expect signal from child: got `Aborted' Does it look like the same issue described here? # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX 88XX Stepping: 0x1 BogoMIPS: 200.00 L1d cache: 32K L1i cache: 78K L2 cache: 16384K Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 # gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/aarch64-unknown-linux-gnu/9.3.0/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-9.3.0/work/gcc-9.3.0/configure --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --prefix=/usr --bindir=/usr/aarch64-unknown-linux-gnu/gcc-bin/9.3.0 --includedir=/usr/lib/gcc/aarch64-unknown-linux-gnu/9.3.0/include --datadir=/usr/share/gcc-data/aarch64-unknown-linux-gnu/9.3.0 --mandir=/usr/share/gcc-data/aarch64-unknown-linux-gnu/9.3.0/man --infodir=/usr/share/gcc-data/aarch64-unknown-linux-gnu/9.3.0/info --with-gxx-include-dir=/usr/lib/gcc/aarch64-unknown-linux-gnu/9.3.0/include/g++-v9 --with-python-dir=/share/gcc-data/aarch64-unknown-linux-gnu/9.3.0/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 9.3.0 p2' --disable-esp --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-altivec --disable-fixed-point --enable-libgomp --disable-libmudflap --disable-libssp --disable-libada --disable-systemtap --enable-vtable-verify --enable-lto --without-isl --enable-default-pie --enable-default-ssp Thread model: posix gcc version 9.3.0 (Gentoo 9.3.0 p2) # uname -r 4.9.0-4-arm64 (In reply to Sergei Trofimovich from comment #18) > ... > Sun 03 May 2020 10:46:55 AM UTC > Didn't expect signal from child: got `Segmentation fault' > Sun 03 May 2020 10:47:04 AM UTC > double free or corruption (!prev) > Didn't expect signal from child: got `Aborted' > ... > Sun 03 May 2020 10:50:54 AM UTC > free(): invalid pointer > Didn't expect signal from child: got `Aborted' > ... > Sun 03 May 2020 10:52:12 AM UTC > tst-stack4: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top > (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && > prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' > failed. > Didn't expect signal from child: got `Aborted' > > Does it look like the same issue described here? it can be related, hard to tell. (your failures are consistently heap corruptions detected in malloc/free, instead of dynamic tls related state corruption) if you can rebuild glibc try the patches from comment 7 if they don't help then your issue is different. (if the issue disappears we don't know if the new barriers just masked your issue or fixed them). (In reply to Szabolcs Nagy from comment #19) > (In reply to Sergei Trofimovich from comment #18) > > ... > > Sun 03 May 2020 10:46:55 AM UTC > > Didn't expect signal from child: got `Segmentation fault' > > Sun 03 May 2020 10:47:04 AM UTC > > double free or corruption (!prev) > > Didn't expect signal from child: got `Aborted' > > ... > > Sun 03 May 2020 10:50:54 AM UTC > > free(): invalid pointer > > Didn't expect signal from child: got `Aborted' > > ... > > Sun 03 May 2020 10:52:12 AM UTC > > tst-stack4: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top > > (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && > > prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' > > failed. > > Didn't expect signal from child: got `Aborted' > > > > Does it look like the same issue described here? > > it can be related, hard to tell. > (your failures are consistently heap corruptions > detected in malloc/free, instead of dynamic tls > related state corruption) > > if you can rebuild glibc try the patches from > comment 7 if they don't help then your issue > is different. (if the issue disappears we don't > know if the new barriers just masked your issue > or fixed them). Tried patches from #comment7 on glibc-2.30. No failures after 100 test runs. Usually fails after 3-4 runs. running latest Ubuntu 20.04.1 LTS $ ldd --version ldd (Ubuntu GLIBC 2.31-0ubuntu9) 2.31 Just got this when launching google-chrome from the command line Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! Command exited with non-zero status 127 Could that assert be updated to give more information if you have another assert_str() macro that could be used. It's a bit surprising there is an assert() in a release build in Ubuntu. Usually assert() would be for a debug build. On my computer any C program compiled with assert(0) dumps a core file, but this glibc issue assert does not dump a core file. Is there an issue with this assert macro in glibc? The message output on the terminal is different from the standard macro $ ./a a: a.c:6: main: Assertion `0' failed. Aborted (core dumped) $ cat a.c // gcc -Wall -o a a.c #include <assert.h> int main() { assert(0); } (In reply to Jonny Grant from comment #23) > On my computer any C program compiled with assert(0) dumps a core file, but > this glibc issue assert does not dump a core file. Is there an issue with > this assert macro in glibc? The message output on the terminal is different > from the standard macro Please ask these questions on libc-help@sourceware.org where developers can help you with any issues you have trying to put together a reproducer. Hi, when I use Szabolcs Nagy's comment 2 comment 3, I got Assertion: Inconsistency detected by ld.so: dl-tls.c: 517: _dl_allocate_tls_init: Assertion `listp != NULL' failed! Also, I try this testcase pacth: https://patchwork.ozlabs.org/project/glibc/patch/5836CC80.9070101@arm.com/ After I put patch the patch into the glibc code, and run nptl testcase, I got: ../scripts/evaluate-test.sh nptl/tst-tls7 $? false false > /root/build/build/glibc/nptl/tst-tls7.test-result Inconsistency detected by ld.so: dl-tls.c: 517: _dl_allocate_tls_init: Assertion `listp != NULL' failed! Both of the testcases got different assertion, not the assertion: Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= _rtld_local._dl_tls_generation' failed! So, how to reproduce this problem to get the same failure. And is there a plan to solve this problem? This problem has been going on for a long time. Thanks! Update test result: I use glibc2.28 source code to reproduce this problem: 1. testcase: Szabolcs Nagy's comment 2 comment 3 result: Inconsistency detected by ld.so: dl-tls.c: 517: _dl_allocate_tls_init: Assertion `listp != NULL' failed! Not the same assertion 2. testcase: add Szabolcs Nagy's testcase v3 into nptl testcase: https://patchwork.ozlabs.org/project/glibc/patch/5836CC80.9070101@arm.com/ tst-tls7 result: same as No.1 3. testcase same as NO.2, However I add usleep(1000) between _dl_add_to_slotinfo and ++GL(dl_tls_generation) at file elf/dl-open.c. tst-tls7 result: same as No.1 tst-stack4 can reliably reproduce this problem i wrote down some more background before i resubmit my patches: https://sourceware.org/pipermail/libc-alpha/2020-December/121090.html I still see this issue in 2.33 testing. I saw it recently for ppc64 BE testing. i have a new patch set that includes a different fix for this bug: https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html the new fix takes the dlopen lock at thread creation time instead of just using atomics (which cannot work for fixing the same race with dlclose: bug 27111). using atomics is still necessary for tls access. it will likely take a few review iterations to get this in glibc. (In reply to Szabolcs Nagy from comment #29) > i have a new patch set that includes a different fix for this bug: > https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html > > the new fix takes the dlopen lock at thread creation time instead > of just using atomics (which cannot work for fixing the same race > with dlclose: bug 27111). > > using atomics is still necessary for tls access. > > it will likely take a few review iterations to get this in glibc. Hi,Szabolcs, Do you know when will these patches be reviewed? Their Delegate is still Nobody, https://patchwork.sourceware.org/project/glibc/list/?series=1673. The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1387ad6225c2222f027790e3f460e31aa5dd2c54 commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Wed Dec 30 19:19:37 2020 +0000 elf: Fix data races in pthread_create and TLS access [BZ #19329] DTV setup at thread creation (_dl_allocate_tls_init) is changed to take the dlopen lock, GL(dl_load_lock). Avoiding data races here without locks would require design changes: the map that is accessed for static TLS initialization here may be concurrently freed by dlclose. That use after free may be solved by only locking around static TLS setup or by ensuring dlclose does not free modules with static TLS, however currently every link map with TLS has to be accessed at least to see if it needs static TLS. And even if that's solved, still a lot of atomics would be needed to synchronize DTV related globals without a lock. So fix both bug 19329 and bug 27111 with a lock that prevents DTV setup running concurrently with dlopen or dlclose. _dl_update_slotinfo at TLS access still does not use any locks so CONCURRENCY NOTES are added to explain the synchronization. The early exit from the slotinfo walk when max_modid is reached is not strictly necessary, but does not hurt either. An incorrect acquire load was removed from _dl_resize_dtv: it did not synchronize with any release store or fence and synchronization is now handled separately at thread creation and TLS access time. There are still a number of racy read accesses to globals that will be changed to relaxed MO atomics in a followup patch. This should not introduce regressions compared to existing behaviour and avoid cluttering the main part of the fix. Not all TLS access related data races got fixed here: there are additional races at lazy tlsdesc relocations see bug 27137. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=f4f8f4d4e0f92488431b268c8cd9555730b9afe9 commit f4f8f4d4e0f92488431b268c8cd9555730b9afe9 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Wed Dec 30 19:19:37 2020 +0000 elf: Use relaxed atomics for racy accesses [BZ #19329] This is a follow up patch to the fix for bug 19329. This adds relaxed MO atomics to accesses that were previously data races but are now race conditions, and where relaxed MO is sufficient. The race conditions all follow the pattern that the write is behind the dlopen lock, but a read can happen concurrently (e.g. during tls access) without holding the lock. For slotinfo entries the read value only matters if it reads from a synchronized write in dlopen or dlclose, otherwise the related dtv entry is not valid to access so it is fine to leave it in an inconsistent state. The same applies for GL(dl_tls_max_dtv_idx) and GL(dl_tls_generation), but there the algorithm relies on the fact that the read of the last synchronized write is an increasing value. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=572bd547d57a39b6cf0ea072545dc4048921f4c3 commit 572bd547d57a39b6cf0ea072545dc4048921f4c3 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Thu Dec 31 13:59:38 2020 +0000 elf: Fix DTV gap reuse logic [BZ #27135] For some reason only dlopen failure caused dtv gaps to be reused. It is possible that the intent was to never reuse modids for a different module, but after dlopen failure all gaps are reused not just the ones caused by the unfinished dlopened. So the code has to handle reused modids already which seems to work, however the data races at thread creation and tls access (see bug 19329 and bug 27111) may be more severe if slots are reused so this is scheduled after those fixes. I think fixing the races are not simpler if reuse is disallowed and reuse has other benefits, so set GL(dl_tls_dtv_gaps) whenever entries are removed from the middle of the slotinfo list. The value does not have to be correct: incorrect true value causes the next modid query to do a slotinfo walk, incorrect false will leave gaps and new entries are added at the end. Fixes bug 27135. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> fixed for 2.34 (In reply to cvs-commit@gcc.gnu.org from comment #31) > The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: > > https://sourceware.org/git/gitweb.cgi?p=glibc.git; > h=1387ad6225c2222f027790e3f460e31aa5dd2c54 > > commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > Date: Wed Dec 30 19:19:37 2020 +0000 > > elf: Fix data races in pthread_create and TLS access [BZ #19329] > > DTV setup at thread creation (_dl_allocate_tls_init) is changed > to take the dlopen lock, GL(dl_load_lock). Avoiding data races > here without locks would require design changes: the map that is > accessed for static TLS initialization here may be concurrently > freed by dlclose. That use after free may be solved by only > locking around static TLS setup or by ensuring dlclose does not > free modules with static TLS, however currently every link map > with TLS has to be accessed at least to see if it needs static > TLS. And even if that's solved, still a lot of atomics would be > needed to synchronize DTV related globals without a lock. So fix > both bug 19329 and bug 27111 with a lock that prevents DTV setup > running concurrently with dlopen or dlclose. > > _dl_update_slotinfo at TLS access still does not use any locks > so CONCURRENCY NOTES are added to explain the synchronization. > The early exit from the slotinfo walk when max_modid is reached > is not strictly necessary, but does not hurt either. > > An incorrect acquire load was removed from _dl_resize_dtv: it > did not synchronize with any release store or fence and > synchronization is now handled separately at thread creation > and TLS access time. > > There are still a number of racy read accesses to globals that > will be changed to relaxed MO atomics in a followup patch. This > should not introduce regressions compared to existing behaviour > and avoid cluttering the main part of the fix. > > Not all TLS access related data races got fixed here: there are > additional races at lazy tlsdesc relocations see bug 27137. > > Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> this patch use dl_load_lock in _dl_allocate_tls_init, is there a problem when dlopen a dynamic library which will call pthread_create? I think it will cause dl_load_lock and dl_load_lock dead lock. (In reply to xujing from comment #35) > (In reply to cvs-commit@gcc.gnu.org from comment #31) > > The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: > > > > https://sourceware.org/git/gitweb.cgi?p=glibc.git; > > h=1387ad6225c2222f027790e3f460e31aa5dd2c54 > > > > commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 > > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > > Date: Wed Dec 30 19:19:37 2020 +0000 > > > > elf: Fix data races in pthread_create and TLS access [BZ #19329] > > > > DTV setup at thread creation (_dl_allocate_tls_init) is changed > > to take the dlopen lock, GL(dl_load_lock). Avoiding data races > > here without locks would require design changes: the map that is > > accessed for static TLS initialization here may be concurrently > > freed by dlclose. That use after free may be solved by only > > locking around static TLS setup or by ensuring dlclose does not > > free modules with static TLS, however currently every link map > > with TLS has to be accessed at least to see if it needs static > > TLS. And even if that's solved, still a lot of atomics would be > > needed to synchronize DTV related globals without a lock. So fix > > both bug 19329 and bug 27111 with a lock that prevents DTV setup > > running concurrently with dlopen or dlclose. > > > > _dl_update_slotinfo at TLS access still does not use any locks > > so CONCURRENCY NOTES are added to explain the synchronization. > > The early exit from the slotinfo walk when max_modid is reached > > is not strictly necessary, but does not hurt either. > > > > An incorrect acquire load was removed from _dl_resize_dtv: it > > did not synchronize with any release store or fence and > > synchronization is now handled separately at thread creation > > and TLS access time. > > > > There are still a number of racy read accesses to globals that > > will be changed to relaxed MO atomics in a followup patch. This > > should not introduce regressions compared to existing behaviour > > and avoid cluttering the main part of the fix. > > > > Not all TLS access related data races got fixed here: there are > > additional races at lazy tlsdesc relocations see bug 27137. > > > > Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> > > this patch use dl_load_lock in _dl_allocate_tls_init, is there a problem > when dlopen a dynamic library which will call pthread_create? I think it > will cause dl_load_lock and dl_load_lock dead lock. dlopen will hold on dl_load_lock, and pthread_create will call _dl_allocate_tls_init and then will hold on dl_load_lock. Finally, it will cause dead lock. (In reply to cvs-commit@gcc.gnu.org from comment #31) > The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: > > https://sourceware.org/git/gitweb.cgi?p=glibc.git; > h=1387ad6225c2222f027790e3f460e31aa5dd2c54 > > commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > Date: Wed Dec 30 19:19:37 2020 +0000 > > elf: Fix data races in pthread_create and TLS access [BZ #19329] > > DTV setup at thread creation (_dl_allocate_tls_init) is changed > to take the dlopen lock, GL(dl_load_lock). Avoiding data races > here without locks would require design changes: the map that is > accessed for static TLS initialization here may be concurrently > freed by dlclose. That use after free may be solved by only > locking around static TLS setup or by ensuring dlclose does not > free modules with static TLS, however currently every link map > with TLS has to be accessed at least to see if it needs static > TLS. And even if that's solved, still a lot of atomics would be > needed to synchronize DTV related globals without a lock. So fix > both bug 19329 and bug 27111 with a lock that prevents DTV setup > running concurrently with dlopen or dlclose. > > _dl_update_slotinfo at TLS access still does not use any locks > so CONCURRENCY NOTES are added to explain the synchronization. > The early exit from the slotinfo walk when max_modid is reached > is not strictly necessary, but does not hurt either. > > An incorrect acquire load was removed from _dl_resize_dtv: it > did not synchronize with any release store or fence and > synchronization is now handled separately at thread creation > and TLS access time. > > There are still a number of racy read accesses to globals that > will be changed to relaxed MO atomics in a followup patch. This > should not introduce regressions compared to existing behaviour > and avoid cluttering the main part of the fix. > > Not all TLS access related data races got fixed here: there are > additional races at lazy tlsdesc relocations see bug 27137. > > Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Hi! I think this commit may cause an ABBA deadlock problem which i mentioned in https://sourceware.org/bugzilla/show_bug.cgi?id=28331. (In reply to xujing from comment #35) > (In reply to cvs-commit@gcc.gnu.org from comment #31) > > commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 > > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > > Date: Wed Dec 30 19:19:37 2020 +0000 > > > > elf: Fix data races in pthread_create and TLS access [BZ #19329] > > > this patch use dl_load_lock in _dl_allocate_tls_init, is there a problem > when dlopen a dynamic library which will call pthread_create? I think it > will cause dl_load_lock and dl_load_lock dead lock. the real bug is that ctors are run with the dlopen lock held. that can causes deadlocks anyway (a ctor can create threads and that thread can call dlopen). this is bug 15686 which is not easy to fix, but that's the right solution. (in general, running user callbacks while libc internal locks are held is wrong.) that bug is now more exposed because the lock is also taken at _dl_allocate_tls_init during thread creation. however i expect that to be called in the parent thread only, so there should be no deadlock when ctor calls pthread_create, only when the child thread calls it again (which i considered rare). if you have example code that you think should work but now deadlocks, then please report it. (In reply to Szabolcs Nagy from comment #38) > (In reply to xujing from comment #35) > > (In reply to cvs-commit@gcc.gnu.org from comment #31) > > > commit 1387ad6225c2222f027790e3f460e31aa5dd2c54 > > > Author: Szabolcs Nagy <szabolcs.nagy@arm.com> > > > Date: Wed Dec 30 19:19:37 2020 +0000 > > > > > > elf: Fix data races in pthread_create and TLS access [BZ #19329] > > > > > this patch use dl_load_lock in _dl_allocate_tls_init, is there a problem > > when dlopen a dynamic library which will call pthread_create? I think it > > will cause dl_load_lock and dl_load_lock dead lock. > > the real bug is that ctors are run with the dlopen lock held. > that can causes deadlocks anyway (a ctor can create threads > and that thread can call dlopen). this is bug 15686 which is not > easy to fix, but that's the right solution. (in general, running > user callbacks while libc internal locks are held is wrong.) > > that bug is now more exposed because the lock is also taken > at _dl_allocate_tls_init during thread creation. however i > expect that to be called in the parent thread only, so there > should be no deadlock when ctor calls pthread_create, only > when the child thread calls it again (which i considered rare). > > if you have example code that you think should work but now > deadlocks, then please report it. I'm sorry, I misled you. I think there is an ABBA deadlock issue in some scenarios. If I have a c++ dynamic library(named libA.so) that contains a global object, the global object will call the post-constructor at initialization and hold it's own lock(named A_lock) when dlopen loads libA.so. Assume that two threads execute the following process: Thread1:dlopen(libA.so) => hold dl_load_lock => load libA.so => init global object from libA.so => wait for hold A_lock Thread2:my own code hold A_lock => pthread_create => _dl_allocate_tls_init => wait for hold dl_load_lock In this case, an ABBA deadlock occurs. Is this a bug? My stack looks like this: Thread 1 (LWP 136013): #0 0x00007f57a108510d in ?? () from /usr/lib64/libpthread.so.0 #1 0x00007f57a107e4d1 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #1 stack waiting for holding A_lock ... #6 0x00007f5781c1bb8b in LogProcess::Init (strProcName=..., nProcHandle=nProcHandle@entry=0) at ./service/biz_frame/code/server/src/logging/logprocess.cpp:107 ... #20 0x00007f57a0fef21f in _dl_catch_exception () from /usr/lib64/libc.so.6 #21 0x00007f57a786442b in ?? () from /lib64/ld-linux-x86-64.so.2 #22 0x00007f57a3de2296 in ?? () from /usr/lib64/libdl.so.2 #23 0x00007f57a0fef21f in _dl_catch_exception () from /usr/lib64/libc.so.6 #24 0x00007f57a0fef2af in _dl_catch_error () from /usr/lib64/libc.so.6 #25 0x00007f57a3de2985 in ?? () from /usr/lib64/libdl.so.2 #26 0x00007f57a3de2351 in dlopen () from /usr/lib64/libdl.so.2 ... ... #38 0x00007f57a0fb3520 in clone () from /usr/lib64/libc.so.6 Thread 2 (LWP 134627): #0 0x00007f57a108510d in ?? () from /usr/lib64/libpthread.so.0 #1 0x00007f57a107e580 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #2 0x00007f57a7863835 in _dl_allocate_tls_init () from /lib64/ld-linux-x86-64.so.2 #3 0x00007f57a107cb7c in pthread_create () from /usr/lib64/libpthread.so.0 ... #10 Stack holding A_lock ... #14 0x0000561689e0d579 in main () (In reply to xujing from comment #39) > I'm sorry, I misled you. I think there is an ABBA deadlock issue in some > scenarios. > > If I have a c++ dynamic library(named libA.so) that contains a global > object, the global object will call the post-constructor at initialization > and hold it's own lock(named A_lock) when dlopen loads libA.so. Assume that > two threads execute the following process: > Thread1:dlopen(libA.so) => hold dl_load_lock => load libA.so => init > global > object from libA.so => wait for hold A_lock > Thread2:my own code hold A_lock => pthread_create => > _dl_allocate_tls_init > => wait for hold dl_load_lock > In this case, an ABBA deadlock occurs. Is this a bug? yes i think this should work (it is a lock ordering issue between a user and libc internal lock, which is only possible if user code is run while a libc lock is held) note that if you replace pthread_create with dlopen that deadlocks too. so it's still bug 15686. but it may be more common than i expected. i think we need to look at fixing that bug. (fixing the dynamic tls race of this bug without locks in pthread_create is very hard, so i don't think we can revert the quoted patch) i'm trying to fix the ABBA deadlock by introducing a new lock into dlopen that protects tls state and gets released before init functions are run. using that lock at thread creation would fix the issue. this only requires a small amount of changes, but it seems to be difficult to ensure that the new lock is released on all failure paths within _dl_open_worker (which sometimes uses longjmp for error handling). i opened bug 28357 for the ABBA deadlock in pthread_create. The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=83b5323261bb72313bffcf37476c1b8f0847c736 commit 83b5323261bb72313bffcf37476c1b8f0847c736 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Wed Sep 15 15:16:19 2021 +0100 elf: Avoid deadlock between pthread_create and ctors [BZ #28357] The fix for bug 19329 caused a regression such that pthread_create can deadlock when concurrent ctors from dlopen are waiting for it to finish. Use a new GL(dl_load_tls_lock) in pthread_create that is not taken around ctors in dlopen. The new lock is also used in __tls_get_addr instead of GL(dl_load_lock). The new lock is held in _dl_open_worker and _dl_close_worker around most of the logic before/after the init/fini routines. When init/fini routines are running then TLS is in a consistent, usable state. In _dl_open_worker the new lock requires catching and reraising dlopen failures that happen in the critical section. The new lock is reinitialized in a fork child, to keep the existing behaviour and it is kept recursive in case malloc interposition or TLS access from signal handlers can retake it. It is not obvious if this is necessary or helps, but avoids changing the preexisting behaviour. The new lock may be more appropriate for dl_iterate_phdr too than GL(dl_load_write_lock), since TLS state of an incompletely loaded module may be accessed. If the new lock can replace the old one, that can be a separate change. Fixes bug 28357. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> The release/2.34/master branch has been updated by Florian Weimer <fw@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=024a7640ab9ecea80e527f4e4d7f7a1868e952c5 commit 024a7640ab9ecea80e527f4e4d7f7a1868e952c5 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Wed Sep 15 15:16:19 2021 +0100 elf: Avoid deadlock between pthread_create and ctors [BZ #28357] The fix for bug 19329 caused a regression such that pthread_create can deadlock when concurrent ctors from dlopen are waiting for it to finish. Use a new GL(dl_load_tls_lock) in pthread_create that is not taken around ctors in dlopen. The new lock is also used in __tls_get_addr instead of GL(dl_load_lock). The new lock is held in _dl_open_worker and _dl_close_worker around most of the logic before/after the init/fini routines. When init/fini routines are running then TLS is in a consistent, usable state. In _dl_open_worker the new lock requires catching and reraising dlopen failures that happen in the critical section. The new lock is reinitialized in a fork child, to keep the existing behaviour and it is kept recursive in case malloc interposition or TLS access from signal handlers can retake it. It is not obvious if this is necessary or helps, but avoids changing the preexisting behaviour. The new lock may be more appropriate for dl_iterate_phdr too than GL(dl_load_write_lock), since TLS state of an incompletely loaded module may be accessed. If the new lock can replace the old one, that can be a separate change. Fixes bug 28357. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 83b5323261bb72313bffcf37476c1b8f0847c736) Hello, it seems this bugfix has reached Ubuntu 22.04, which is nice. However some of my users use earlier Ubuntu versions (18.04 and 20.04) where it is not available. How can I apply the fix on those versions? I tried checking out the glibc version available in Ubuntu 18.04 (2.27) and cherry-picking the fixes. But I get a conflict I don't know how to solve because the base files are very different. (In reply to Louis Benazet from comment #45) > Hello, it seems this bugfix has reached Ubuntu 22.04, which is nice. However > some of my users use earlier Ubuntu versions (18.04 and 20.04) where it is > not available. How can I apply the fix on those versions? I tried checking > out the glibc version available in Ubuntu 18.04 (2.27) and cherry-picking > the fixes. But I get a conflict I don't know how to solve because the base > files are very different. Perhaps you could ask Ubuntu to backport the fix. They are quite old releases, but for LTS they might consider your request. If you need something today, it's simpler just to upgrade. |