fixing dynamic tls races (BZ 19329 and friends)

Szabolcs Nagy
Thu Dec 24 16:55:27 GMT 2020

i'm trying to respin my patches for bug 19329 but
there may be disagreement about the approach and i may
not have time to finish this i wrote down a summary:

test case:
(nptl/tst-stack4 sometimes shows the issue too)

in short, the crux of the problem is that dtv updates
at thread creation and tls access time use dynlinker
internal state that is normally protected by locks.
this seems to be by design (the state is versioned by
a generation counter that can allow consistent view
without locks), but the synchronization is missing.

the deeper i dig into this the more issues pop up so i
mapped out some of them that can be relevant to a fix:

1) data races: tls related global dynamic linker state
that is modified under locks in dlopen and dlclose are
accessed without locks or atomics in pthread_create
and at tls access. (early startup does not take locks
either but that is fine: it is single-threaded). one
particular issue is that link_map structs that are
accessed without locks may be concurrently freed by
dlclose, this cannot be fixed just by using atomics.
  GL(dl_tls_generation) - size_t gen counter
  GL(dl_tls_max_dtv_idx) - size_t max modid
  GL(dl_tls_dtv_slotinfo_list) - (gen,map) list
    link_maps are accessed via this list.
slotinfo changes under lock:
  _dl_add_to_slotinfo - dlopen
  remove_slotinfo - dlclose
slotinfo and link_map access without locks:
  _dl_update_slotinfo - tls access
  tls_get_addr_tail - tls access
  _dl_allocate_tls_init - pthread_create
(similar story for generation counter and modid)
  BZ 19329 - pthread_create vs dlopen
  BZ 27111 - pthread_create vs dlclose

2) fail safety: oom failures can abort pthread_create,
dlopen and tls access. tls access cannot poropagate
failures so it must not call failing apis, but lazy
tls and dtv allocation is a core part of the current
design. there is another abort issue: the generation
counter has overflow checks, which can realistically
trigger on 32bit targets. (64bit counter can fix this)
oom abort:
  _dl_resize_dtv - tls access, pthread_create, dlopen
  allocate_and_init - tls access
  _dl_add_to_slotinfo - dlopen
gen count abort:
  update_tls_slotinfo - dlopen
  dl_close_worker - dlclose
  BZ 16134 - oom abort
  BZ 27112 - gen overflow

3) as-safety: tls access should be as-safe, but now it
may call malloc/realloc/free or take the dlopen lock.
  BZ 16133

i tried to fix BZ 19329 4 years ago without changing
the design, just introducing atomics where necessary.
the outcome is a bit hairy and does not solve the
dlclose problem (BZ 27111). (one issue with this is
that all accesses to the shared state has to be atomic
even the ones that are under locks, but sometimes
those can be relaxed atomic since there is no further
ordering requirement other than what is enforced by
the locks. unfortunately these accesses are scattered
around many places. but i still think this is a viable

another approach is to take the dlopen lock at thread
creation and tls access time. this increases the
contention on that lock and may cause new deadlocks
because user code (interposed malloc and ctors) may
run while that lock is held. however this solution is
easier to reason about.

with some design changes the solution may get simpler:

is it necessary to free a link_map with static tls at
dlclose time? accessing such link_maps are the main
trouble at thread creation time (because static tls
has to be initialized early (?)), if dlclosing such
libs are rare then simply not freeing them avoids the
issues at a small cost. (however in glibc there is
dlmopen and static tls optimizations which make it
less obvious how bad this is.)

tlsdesc allows not only the allocation of tls/dtv, but
the symbol lookup and tls descriptor setup to be done
lazily too. this means allocating and accessing
map->l_mach.tlsdesc_table under locks, slotinfo list
walking and various other shared memory accesses at
tls access time. on arm and aarch64 this got removed
because it does not allow fast tls access with the
weak memory model, but on x86 lazy tlsdesc reloc is
still the default. (removing this does not fully fix
any of the bugs: the dtv update at tls access time
still has all the issues, but changing this would
reduce the complexity).

what is the verdict on as-safety of (dynamic) tls
access? currently this is not supported but it is not
clear if i can introduce new locks there or not.

can we avoid the generation count overflow checks if
it is 64bit? (it only counts up on dlopen/dlclose
time, i think it cannot realistically overflow.)

if i get time to dust down my old patches i will post
them, meanwhile feedback is welcome.

More information about the Libc-alpha mailing list