Bug 19329 - dl-tls.c assert failure at concurrent pthread_create and dlopen
Summary: dl-tls.c assert failure at concurrent pthread_create and dlopen
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: dynamic-link (show other bugs)
Version: 2.22
: P2 normal
Target Milestone: ---
Assignee: Szabolcs Nagy
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-04 12:37 UTC by Szabolcs Nagy
Modified: 2019-08-21 07:04 UTC (History)
15 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
test case (main module) (237 bytes, text/x-csrc)
2016-01-08 18:19 UTC, Szabolcs Nagy
Details
test case (build script) (321 bytes, application/x-shellscript)
2016-01-08 18:20 UTC, Szabolcs Nagy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Szabolcs Nagy 2015-12-04 12:37:03 UTC
(this is a continuation of bug 17918, but it turns out to be a different
issue that was originally reported there.)

failure:

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= _rtld_local._dl_tls_generation' failed!

caused by dlopen (in _dl_add_to_slotinfo and in dl_open_worker) doing

  listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
  //...
  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))

while pthread_create (in _dl_allocate_tls_init) concurrently doing

  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));

so

T1:
  y = x + 1;
  ++x;

T2:
  assert(y <= x);

this is hard to trigger as the race window is short compared to the time
dlopen and pthread_create takes, however if i add a usleep(1000) between
the two operations in T1, it is triggered all the time.

the slotinfo and tls generation update lack any sort of synchronization or atomics in _dl_allocate_tls_init (dlopen holds GL(dl_load_lock)).

on x86_64 with added usleep:

(gdb) p _rtld_local._dl_tls_dtv_slotinfo_list->slotinfo[0]@64
$11 = {{gen = 0, map = 0x7ffff7ff94e8}, {gen = 1, map = 0x7ffff7ff94e8}, {gen = 2, map = 0x7ffff0000910}, {gen = 0, map = 0x0} <repeats 61 times>}
(gdb) p _rtld_local._dl_tls_generation
$12 = 1

T1:
#0  0x00007ffff7df2097 in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff7df1f74 in usleep (useconds=<optimised out>) at ../sysdeps/posix/usleep.c:32
#2  0x00007ffff7decc6b in dl_open_worker (a=a@entry=0x7ffff7611c80) at dl-open.c:527
#3  0x00007ffff7de8314 in _dl_catch_error (objname=objname@entry=0x7ffff7611c70, errstring=errstring@entry=0x7ffff7611c78, mallocedp=mallocedp@entry=0x7ffff7611c6f, 
    operate=operate@entry=0x7ffff7dec720 <dl_open_worker>, args=args@entry=0x7ffff7611c80) at dl-error.c:187
#4  0x00007ffff7dec2a9 in _dl_open (file=0x7ffff7611ee0 "mod-0.so", mode=-2147483646, caller_dlopen=0x4007e2 <start+34>, nsid=-2, argc=<optimised out>, 
    argv=<optimised out>, env=0x7fffffffe378) at dl-open.c:652
#5  0x00007ffff7bd5ee9 in dlopen_doit (a=a@entry=0x7ffff7611eb0) at dlopen.c:66
#6  0x00007ffff7de8314 in _dl_catch_error (objname=0x7ffff00008d0, errstring=0x7ffff00008d8, mallocedp=0x7ffff00008c8, operate=0x7ffff7bd5e90 <dlopen_doit>, 
    args=0x7ffff7611eb0) at dl-error.c:187
#7  0x00007ffff7bd6521 in _dlerror_run (operate=operate@entry=0x7ffff7bd5e90 <dlopen_doit>, args=args@entry=0x7ffff7611eb0) at dlerror.c:163
#8  0x00007ffff7bd5f82 in __dlopen (file=file@entry=0x7ffff7611ee0 "mod-0.so", mode=mode@entry=2) at dlopen.c:87
#9  0x00000000004007e2 in start (a=<optimised out>) at a.c:19
#10 0x00007ffff79bf3d4 in start_thread (arg=0x7ffff7612700) at pthread_create.c:333
#11 0x00007ffff76feedd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

T2:
#0  __GI___assert_fail (assertion=0x7ffff7df8840 "listp->slotinfo[cnt].gen <= GL(dl_tls_generation)", file=0x7ffff7df68e6 "dl-tls.c", line=493, 
    function=0x7ffff7df9020 <__PRETTY_FUNCTION__.9528> "_dl_allocate_tls_init") at dl-minimal.c:220
#1  0x00007ffff7deb492 in __GI__dl_allocate_tls_init (result=0x7fffb7fff700) at dl-tls.c:493
#2  0x00007ffff79bff67 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7fffffffdf90) at allocatestack.c:579
#3  __pthread_create_2_1 (newthread=newthread@entry=0x7fffffffe078, attr=attr@entry=0x0, start_routine=start_routine@entry=0x4007c0 <start>, arg=arg@entry=0xd)
    at pthread_create.c:526
#4  0x000000000040062a in main () at a.c:34


i think
  GL(dl_tls_generation)
  GL(dl_tls_dtv_slotinfo_list)
  listp->slotinfo[i].map
  listp->slotinfo[i].gen
  listp->next
  
may all be accessed concurrently by pthread_create and dlopen without
any synchronization.

this can also cause wrong maxgen computation into dtv[0].counter
Comment 1 Ilya Palachev 2015-12-29 10:51:48 UTC
Hi, I've suggested a patch for this bug:
https://sourceware.org/ml/libc-alpha/2015-12/msg00570.html
Comment 2 Szabolcs Nagy 2016-01-08 18:19:09 UTC
Created attachment 8893 [details]
test case (main module)
Comment 3 Szabolcs Nagy 2016-01-08 18:20:10 UTC
Created attachment 8894 [details]
test case (build script)
Comment 4 Szabolcs Nagy 2016-02-02 11:14:58 UTC
assigned this to myself, will work on it for 2.24, the current latest patch is
https://sourceware.org/ml/libc-alpha/2016-01/msg00480.html
Comment 5 Matt Avant 2016-08-05 20:40:10 UTC
Is this patch still being reviewed? The last update I see is https://sourceware.org/ml/libc-alpha/2016-01/msg00620.html, but I'm not familiar with how issue tracking works for this project so I could easily have missed something...
Comment 6 Markus Trippelsdorf 2017-06-15 05:45:43 UTC
I sometimes see the same failure during make check:

env GCONV_PATH=/var/tmp/glibc-build/iconvdata LOCPATH=/var/tmp/glibc-build/localedata LC_ALL=C   /var/tmp/glibc-build/elf/ld-linux-x86-64.so.2 --library-path /var/tmp/glibc-build:/var/tmp/glibc-build/math:/var/tmp/glibc-build/elf:/var/tmp/glibc-build/dlfcn:/var/tmp/glibc-build/nss:/var/tmp/glibc-build/nis:/var/tmp/glibc-build/rt:/var/tmp/glibc-build/resolv:/var/tmp/glibc-build/crypt:/var/tmp/glibc-build/mathvec:/var/tmp/glibc-build/support:/var/tmp/glibc-build/nptl /var/tmp/glibc-build/nptl/tst-stack4  > /var/tmp/glibc-build/nptl/tst-stack4.out; \                                                                                                                                                                   
../scripts/evaluate-test.sh nptl/tst-stack4 $? false false > /var/tmp/glibc-build/nptl/tst-stack4.test-result                                                                      
Inconsistency detected by ld.so: dl-tls.c: 488: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! 

This is unfortunately not reproducible.
Comment 7 Pádraig Brady 2017-08-28 21:24:52 UTC
We were often hitting this issue with some multithreaded binaries with many shared libs. These patches referenced here, address the issue. Specifically:
  https://patches.linaro.org/patch/85007/
  https://patches.linaro.org/patch/85008/

These have been _extensively_ tested here with glibc-2.23 with many binaries
Comment 8 Carlos O'Donell 2017-08-29 20:30:16 UTC
(In reply to Pádraig Brady from comment #7)
> We were often hitting this issue with some multithreaded binaries with many
> shared libs. These patches referenced here, address the issue. Specifically:
>   https://patches.linaro.org/patch/85007/
>   https://patches.linaro.org/patch/85008/
> 
> These have been _extensively_ tested here with glibc-2.23 with many binaries

Please repost those to libc-alpha so we can review.
Comment 9 Pádraig Brady 2017-09-29 23:54:54 UTC
We found an off by one issue with this (with ASAN + certain number of shared libs). When the last vector in the _dl_allocate_tls_init list of vectors was of size one it would have been skipped. The fix is:

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 073321c..2c9ad2a 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -571,7 +571,7 @@ _dl_allocate_tls_init (void *result)
        }

       total += cnt;
-      if (total >= dtv_slots)
+      if (total > dtv_slots)
        break;

       /* Synchronize with dl_add_to_slotinfo.  */
Comment 10 Mike Mezeul 2018-01-16 18:20:03 UTC
Has there been any activity on this one lately? Does anyone know if a fix will be coming anytime soon?
Comment 11 Pádraig Brady 2018-01-17 15:41:50 UTC
This has been _very_ well tested at facebook
Note the additional fix in comment #9
It would be great to merge this. thanks!
Comment 12 Szabolcs Nagy 2018-01-17 16:40:58 UTC
(In reply to Pádraig Brady from comment #11)
> This has been _very_ well tested at facebook
> Note the additional fix in comment #9
> It would be great to merge this. thanks!

sorry i didnt have time to work on this in this release cycle, i'll try to look at it in the next one if others don't beat me to it (the comments can be improved, dtv_slots should be fixed so it has consistent meaning and one should reason about the consequences of removing the asserts, they might catch valid corruption that is still present via dlclose races that are not fixed).
Comment 13 Lukasz Koniecki 2019-02-27 18:43:08 UTC
(In reply to Szabolcs Nagy from comment #12)

> sorry i didnt have time to work on this in this release cycle, i'll try to
> look at it in the next one if others don't beat me to it (the comments can
> be improved, dtv_slots should be fixed so it has consistent meaning and one
> should reason about the consequences of removing the asserts, they might
> catch valid corruption that is still present via dlclose races that are not
> fixed).

Any update on this? It has been over a year since the last comment.
Comment 14 Mike Gulick 2019-05-24 19:14:35 UTC
Just want to add that the two patches posted here (and the off-by-one fix in the comments) have been running by my employer (MathWorks) on at least 1000 Debian 9 systems for the past 6 months without issue.  It would be great if these patches could be accepted into glibc itself.
Comment 15 Carlos O'Donell 2019-05-24 19:46:49 UTC
(In reply to Mike Gulick from comment #14)
> Just want to add that the two patches posted here (and the off-by-one fix in
> the comments) have been running by my employer (MathWorks) on at least 1000
> Debian 9 systems for the past 6 months without issue.  It would be great if
> these patches could be accepted into glibc itself.

None of these changes are easy to integrate because they fail to explain in detailed notes why they are correct. We take the conservative approach not to apply complete solutions. Someone seeing this problem has to take the position to champion the broader solution as positioned by Szabolcs from Arm. Alternatively someone needs to explain why the partial solution is better than no solution and champion that on libc-alpha.
Comment 16 Robert O'Callahan 2019-06-26 01:04:45 UTC
This testcase seems to reproduce the bug pretty reliably.

https://github.com/jrmuizel/dl-open-race
Comment 17 Robert O'Callahan 2019-06-26 03:03:44 UTC
Sorry, it actually does not. And I see there was already a testcase posted here. https://sourceware.org/ml/libc-alpha/2016-11/msg00917.html