Created attachment 9165 [details] Patch that fixes the TLS performance degradation after dlopen Hello, we have noticed a performance degradation of TLS access in shared libraries. If another shared library that uses TLS is loaded via dlopen, __tls_get_addr takes significant more time. Once that shared library accesses it's TLS, the performance normalizes. Attached is a minimal example which gives the following output: time: 0.723744 after dlopen, time: 1.789016 after tls access in loaded lib, time: 0.672865 We do have a use-case where this is actually really significant. I believe this happens for instance if libstdc++ is loaded implicitly, but TLS features are not actively used. I strongly suspect this is the same issue as discussed in this post on the µClibc mailing list: https://lists.osuosl.org/pipermail/uclibc/2009-December/043375.html and therefore the patch provided mainly reuses the solution they've found for µClibc. Main development platform information (tested on multiple platforms): * latest glibc git version (pulled at Fri Apr 8 10:50:19 CEST 2016) * x86_64-pc-linux-gnu (but should be independent) * Linux vishnu 4.2.0-23-generic #28-Ubuntu SMP Sun Dec 27 17:47:31 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux * gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) * GNU ld (GNU Binutils for Ubuntu) 2.25.1 The patch provided touches code already altered (but not released) by the work on bug 19329. Regards, Philipp Trommler.
Created attachment 9166 [details] tls_mini.c
Created attachment 9167 [details] tls_access.c
Created attachment 9168 [details] Makefile
Created attachment 9169 [details] dummy.c
After a month of waiting, aren't there any thoughts about this? Do you need more information? Are there any caveats preventing the inclusion of the patch?
(In reply to Philipp Trommler from comment #5) > After a month of waiting, aren't there any thoughts about this? Do you need > more information? Are there any caveats preventing the inclusion of the > patch? Hi, patch proposal should be discussed in libc-alpha for review.
*** Bug 27248 has been marked as a duplicate of this bug. ***
We encountered the same problem. After using this patch, the performance has improved, but we have seen that this case has not been updated after 2018 and has not been merged upstream. Can we continue to push forward?
We need someone to rebase the previous patch [1], test if it really fixes the issue, provide a testcase that reproduces the issue. The last submission [1] still have the same 'Tilera' common that Chris Metacalf has asked to be removed and the commit message did not describe the issue not what it is aiming to fix. Also, as Szabolcs has added this issues related to the BZ#19329 and he suggested first track this down to fix the correctness of this issue before optimizing it. [1] https://sourceware.org/pipermail/libc-alpha/2018-January/090810.html
It is difficult to reproduce the problem of providing test cases. Besides, how can we push this problem forward? Can you give some suggestions and advance it further? For "rebase the previous patch",From the historical information, I think we can continue to rely on "Philipp Trommler".
Created attachment 13253 [details] tls performace Test step: I clone newest version for glibc and compile it. export glibc_install="$(pwd)/glibc/build/install" git clone git://sourceware.org/git/glibc.git cd glibc mkdir build cd build ../configure --prefix "$glibc_install" make Download test case and compile it. Generate tls_mini. $LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 --list ./tls_mini linux-vdso.so.1 (0x00007ffdf37ce000) libtls_access.so => /home/ksong/work/test/tls/libtls_access.so (0x00007f0d812ff000) libdl.so.2 => /home/ksong/work/package/glibc/build/dlfcn/libdl.so.2 (0x00007f0d810fb000) libc.so.6 => /home/ksong/work/package/glibc/build/libc.so.6 (0x00007f0d80d3b000) /lib64/ld-linux-x86-64.so.2 => /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 (0x00007f0d81704000) Exec tls_mini. $LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.510322 time: 1.468314 time: 0.501844 after "git am --reject 0001-Fix-TLS-performance-degradation-after-dlopen.patch" and recompile glibc. Exec tls_mini $LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.585535 time: 0.555755 time: 0.554991 We can see a big change in the performance of dlopen.
(In reply to ksong from comment #11) > We can see a big change in the performance of dlopen. It would be great if you could test Szabolcs series here and see if it has a positive impact for you: https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html My suspicion is that this bug is the same as 19329 or other related TLS races.
Refer https://sourceware.org/pipermail/libc-alpha/2021-February/122626.html https://patchwork.ozlabs.org/project/glibc/list/?submitter=65667&page=4 Try backport patch form Szabolcs Nagy. https://patchwork.ozlabs.org/project/glibc/patch/583EBBA0.9060709@arm.com/ https://patchwork.ozlabs.org/project/glibc/patch/583EBB99.8040101@arm.com/ Rebuild glibc. Exec tls_mini test case. [ctu-ksong-d1:tls]$LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.583575 time: 1.244377 time: 0.553654 It has some performance improvements, but not as obvious as the previous solution.
Created attachment 13263 [details] 0001-This-fixes-a-subset-of-the-issues-described-in.patch
Created attachment 13264 [details] 0002-This-patch-is-not-necessary-for-the-bug-fix-just-mak.patch
Refer http://patches-tcwg.linaro.org/patch/49142/ I also try backport these patch. [01/15] aarch64: free tlsdesc data on dlclose [BZ #27403] [02/15] elf: Fix data race in _dl_name_match_p [BZ #21349] [03/15] Add test case for [BZ #19329] [04/15] Add a DTV setup test [BZ #27136] [05/15] elf: Fix a DTV setup issue [BZ #27136] [06/15] elf: Fix comments and logic in _dl_add_to_slotinfo [07/15] elf: Refactor _dl_update_slotinfo to avoid use after free [08/15] elf: Fix data races in pthread_create and TLS access [BZ #19329] [09/15] elf: Use relaxed atomics for racy accesses [BZ #19329] [10/15] elf: Fix DTV gap reuse logic [BZ #27135] and try to rebuild glibc, but for our testcase tls_mini, Performance has not improved. For below patch, [11/15] x86_64: Avoid lazy relocation of tlsdesc [BZ #27137] [12/15] i386: Avoid lazy relocation of tlsdesc [BZ #27137] [13/15] x86_64: Remove lazy tlsdesc relocation related code [14/15] i386: Remove lazy tlsdesc relocation related code [15/15] elf: Remove lazy tlsdesc relocation related code I backported them,but compile failed. In file included from dynamic-link.h:92:0, from dl-load.c:60: ../sysdeps/x86_64/dl-machine.h: In function ‘elf_machine_runtime_setup’: ../sysdeps/x86_64/dl-machine.h:132:23: error: ‘_dl_tlsdesc_resolve_rela’ undeclared (first use in this function); did you mean ‘_dl_tlsdesc_return’? = (ElfW(Addr)) &_dl_tlsdesc_resolve_rela; ^~~~~~~~~~~~~~~~~~~~~~~~ _dl_tlsdesc_return ../sysdeps/x86_64/dl-machine.h:132:23: note: each undeclared identifier is reported only once for each function it appears in
hi,@all For the issue, any update? Thanks!
(In reply to ksong from comment #17) > hi,@all > For the issue, any update? > Thanks! None made yet. The bigger issue right now is probably reviewing Szabolc Nagy's (Arm) work to remote data races from the early TLS access in the loader. I feel like that work is a pre-requisite for even looking at this. e.g. https://patchwork.sourceware.org/project/glibc/list/?series=1673
Refer https://sourceware.org/bugzilla/show_bug.cgi?id=19329 I have seen some related modifications have been merged. But based on the latest version, performance problems still have not improved, some related code is not merged. Please refer test result: LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.509420 time: 1.406026 time: 0.493889 I want to know what is next. Thanks!
Created attachment 13457 [details] Fix-TLS-performance-degradation-after-dlopen
Based on the latest version, I made a new patch And did the same test, and its performance is obviously better. LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so./ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/wd/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/buwork/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.583998 time: 0.560171 time: 0.553203
The recent dynamic TLS fixes pushed upstream did not try to fix this issue. Szabolcs send a RFC fix [1], and it seems to fix this issue. [1] https://patchwork.sourceware.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/
Thanks I made new patch base newest version and do same test. https://patchwork.sourceware.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/. LANG=C /home/ksong/work/package/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/ksong/work/test/tls:/home/ksong/work/package/glibc/build:/home/ksong/work/package/glibc/build/math:/home/ksong/work/package/glibc/build/elf:/home/ksong/work/package/glibc/build/dlfcn:/home/ksong/work/package/glibc/build/nss:/home/ksong/work/package/glibc/build/nis:/home/ksong/work/package/glibc/build/rt:/home/ksong/work/package/glibc/build/resolv:/home/ksong/work/package/glibc/build/crypt:/home/ksong/work/package/glibc/build/mathvec:/home/ksong/work/package/glibc/build/nptl:/lib64 ./tls_mini time: 0.551995 time: 0.545402 time: 0.544759 The performance has been improved.
For the patch. https://patchwork.sourceware.org/project/glibc/patch/b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm.com/ When can it be merged into the main branch? Thanks!
> The recent dynamic TLS fixes pushed upstream did not try to fix this issue. > Szabolcs send a RFC fix [1], and it seems to fix this issue. > https://patchwork.sourceware.org/project/glibc/patch/ > b116855de71098ef7dd2875dd3237f8f3ecc12c2.1618301209.git.szabolcs.nagy@arm. > com/ Hi, Can anyone please update regarding the above mentioned patch? Will it be merged upstream as its fixing the issue or should it be further modified? Thanks, pgowda
my advice is to send the patch to the libc-alpha list. This is the usual way patches are proposed and reviewed: https://sourceware.org/mailman/listinfo/libc-alpha Please also include a small example demonstrating why the patch improves things.
affects sage math too https://trac.sagemath.org/ticket/34850
The master branch has been updated by Szabolcs Nagy <nsz@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d2123d68275acc0f061e73d5f86ca504e0d5a344 commit d2123d68275acc0f061e73d5f86ca504e0d5a344 Author: Szabolcs Nagy <szabolcs.nagy@arm.com> Date: Tue Feb 16 12:55:13 2021 +0000 elf: Fix slow tls access after dlopen [BZ #19924] In short: __tls_get_addr checks the global generation counter and if the current dtv is older then _dl_update_slotinfo updates dtv up to the generation of the accessed module. So if the global generation is newer than generation of the module then __tls_get_addr keeps hitting the slow dtv update path. The dtv update path includes a number of checks to see if any update is needed and this already causes measurable tls access slow down after dlopen. It may be possible to detect up-to-date dtv faster. But if there are many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at least walking the slotinfo list. This patch tries to update the dtv to the global generation instead, so after a dlopen the tls access slow path is only hit once. The modules with larger generation than the accessed one were not necessarily synchronized before, so additional synchronization is needed. This patch uses acquire/release synchronization when accessing the generation counter. Note: in the x86_64 version of dl-tls.c the generation is only loaded once, since relaxed mo is not faster than acquire mo load. I have not benchmarked this. Tested by Adhemerval Zanella on aarch64, powerpc, sparc, x86 who reported that it fixes the performance issue of bug 19924. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Fixed for glibc 2.39.