This is the mail archive of the
mailing list for the glibc project.
[RFC][BZ #19329] high level description of pthread_create vs dlopen races
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: GNU C Library <libc-alpha at sourceware dot org>
- Cc: <nd at arm dot com>, Torvald Riegel <triegel at redhat dot com>
- Date: Mon, 28 Nov 2016 19:03:18 +0000
- Subject: [RFC][BZ #19329] high level description of pthread_create vs dlopen races
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Szabolcs dot Nagy at arm dot com;
- Nodisclaimer: True
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Before sending an updated version of
i try to provide high level description of the issues (that can go
into the concurrency notes of the patch) with the hope that agreeing
on the design will allow less patch review iterations.
Dynamic linker internal data structures for thread local storage may
be accessed concurrently causing assertion failures in dl-tls.c when
dlopen loads modules that have tls concurrently with pthread_create.
Assertions that may trigger:
Inconsistency detected by ld.so: dl-tls.c: 520: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <=
Inconsistency detected by ld.so: dl-tls.c: 556: _dl_allocate_tls_init: Assertion `listp != NULL' failed!
Related recent discussions:
TLS performance degradation after dlopen:
Test case patch:
Current behaviour (minor details omitted):
dlopen (operations in program order):
Load a module and its dependencies, assign a modid to each
while incrementing GL(dl_tls_max_dtv_idx) (this is the dtv
and slotinfo index of the module).
Add each module to GL(dl_tls_dtv_slotinfo_list) using
_dl_add_to_slotinfo. Assign the next generation count to
these slotinfo entries.
Increment the GL(dl_tls_generation) counter.
Update the dtv of the current thread following the slotinfo
changes by calling _dl_update_slotinfo (only resizes dtv and
fills it with unallocated slots, actual tls allocation and
initialization is deferred).
Call _dl_allocate_tls_init to allocate dtv and tls for the
new thread to be created, covering all currently loaded modules.
("currently loaded" means up to a generation count, the dtv and
tls for modules with larger generation count are allocated lazily
on first access).
dlclose can write the relevant globals too, tls access and
dl_iterate_phdr read them (i focus on dlopen and pthread_create
first, similar concerns affect these apis). dlmain can write
these globals too, but i think that it is single threaded
(early startup, no pthreads yet, can audit modules change that?).
In short: relevant globals (dl_tls_max_dtv_idx, dl_tls_generation,
dl_tls_dtv_slotinfo_list entries) are written by dlopen and
pthread_create reads them when initializing the dtv of a new thread.
Current concurrency design:
dlopen holds GL(dl_load_lock) while writing the relevant globals (using
volatile loads and stores). The lock serializes dlopen calls, but there
is no synchronization with pthread_create (other than mmap and mprotect
calls which are likely synchronized but not part of the c11 memory model).
pthread_create accesses the globals without locks and uses volatile loads:
dtv size and valid entries in the slotinfo list are determined based on
GL(dl_tls_max_dtv_idx) with the following asserted assumptions:
(a) slotinfo list have at least GL(dl_tls_max_dtv_idx) entries set up,
(b) slotinfo entries have generation count <= GL(dl_tls_generation).
These would be wrong even if all memory accesses were mo_seq_cst: slotinfo
entries are added after modid is assigned and the generation counter is
incremented even later.
Current design bugs in glibc:
(0) There are data races on the relevant globals. And there are asserts
that are not guaranteed (a,b above). This is what i was trying to fix.
(1) Any access to link_map struct is invalid if neither GL(dl_load_lock)
nor GL(dl_load_write_lock) are held, because a concurrent dlclose may
free the map, this affects:
pthread_create (_dl_allocate_tls_init reads map.l_tls_offset)
__tls_get_addr (_dl_update_slotinfo reads map.l_tls_offset)
(2) dlopen can reuse modids of dlclosed modules (and dlclose can decrement
GL(dl_tls_max_dtv_idx)), so multiple reads of a slotinfo entry may be
inconsistent if neither GL(dl_load_lock) nor GL(dl_load_write_lock) are
held (e.g. concurrent dlclose and dlopen between reading the .gen and
.map members of the slotinfo entry), this affects:
pthread_create (_dl_allocate_tls_init, reads .gen and .map)
__tls_get_addr (_dl_update_slotinfo, reads .gen and .map)
(3) tls access must be as-safe, but currently tls is allocated at first
access if the containing module was not loaded at the time the accessing
thread was created. This means __tls_get_addr may malloc or lock
GL(dl_load_lock) or modify the thread local dtv. This can cause deadlocks,
oom crashes or memory corruption if it is in a signal handler (malloc in
signal handler is also problematic for malloc interposition).
(4) allocation failures can oom crash in pthread_create and dlopen
(_dl_resize_dtv) and at tls access (various mallocs) or may leave
inconsistent state behind in dlopen (_dl_add_to_slotinfo or any
later call in dlopen that fails: tls changes are not rolled back).
(5) GL(dl_tls_generation) is not checked for overflow consistently and
slotinfo entries can get assigned an overflowed generation count, its type
is size_t and both dlopen and dlclose increment it, so overflow may be a
concern on 32bit targets.
(6) dlclose holds GL(dl_load_lock) while running ctors so if those ctors
wait for anything that may take the same lock there is a deadlock.
(0) can be fixed by adding atomics and removing invalid asserts:
It can be guaranteed that whatever GL(dl_tls_generation) is read in
pthread_create, all slotinfo entries with that generation will be read
too and the read GL(dl_tls_max_dtv_idx) is no less than the modids of
those entries. Then walking the slotinfo list until the end and only
considering ones up to the observed generation count would set up the
dtv consistently without races. Assuming there is no concurrent dlclose.
(I think my previous patch did this.)
OTOH (1) and (2) clearly point to the direction that pthread_create
should take the lock, in which case thread creation performance may go
down a bit and the influence of (6) increases (but that is broken anyway)
and we didn't do anything about __tls_get_addr. (I don't see other reason
to avoid taking the lock).
Fixing (3) or (6) would need significant redesign.
Fixing (4) and (5) may be possible independently.
I think these are the immediately available options:
- lock free pthread_create that is broken with concurrent dlclose,
- pthread_create that locks and may deadlock when synchronized with a ctor,
- or current brokenness (concurrent pthread_create and dlopen asserts).