This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
- From: Torvald Riegel <triegel at redhat dot com>
- To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, i dot palachev at samsung dot com, nd at arm dot com
- Date: Mon, 25 Jan 2016 14:03:30 +0100
- Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
- Authentication-results: sourceware.org; auth=none
- References: <568D5E11 dot 3010301 at arm dot com> <5693D908 dot 8090104 at arm dot com> <5695B7BF dot 7050000 at redhat dot com> <569E8050 dot 20606 at arm dot com> <1453424564 dot 4592 dot 57 dot camel at localhost dot localdomain> <56A28381 dot 1070707 at arm dot com>
On Fri, 2016-01-22 at 19:31 +0000, Szabolcs Nagy wrote:
> On 22/01/16 01:02, Torvald Riegel wrote:
> > important because it's our way to deal with not having atomic types
> > right now (and so the compiler is aware, too), and because it's
> > important for readers of the code because it alerts them that this is
> > concurrent code.
> >> + if (__builtin_expect (newgen == 0, 0))
> >> + _dl_fatal_printf (N_("\
> >> TLS generation counter wrapped! Please report this."));
> >> + /* Synchronize with the load acquire in _dl_allocate_tls_init.
> >> + See the CONCURRENCY NOTES there in dl-tls.c. */
> >> + atomic_store_release (&GL(dl_tls_generation), newgen);
> > Is there any reason you wouldn't want a release MO atomic increment
> > here?
> i wanted to avoid storing an overflowing counter.
> (didn't want to analyze that case)
If there are concurrent increments, not using an atomic increment will
loose updates. If that can happen, the code should point out why that's
Possible countermeasures are using a CAS loop (less scalable if there is
contention (is there?) but simple) or setting the overflow threshold low
enough and implementing an anti-overflow mechanism so that the maximum
number of threads that can run concurrently can never actually cause an
overflow (not simple, but also not too hard; see the new barrier, for
> >> + During dynamic TLS and DTV allocation and setup various objects may be
> >> + accessed concurrently:
> >> +
> >> + GL(dl_tls_max_dtv_idx)
> >> + GL(dl_tls_generation)
> >> + listp->slotinfo[i].map
> >> + listp->slotinfo[i].gen
> >> + listp->next
> >> +
> >> + where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list. The public
> >> + APIs that may access them are
> >> +
> >> + Writers: dlopen, dlclose and dynamic linker start up code.
> >> + Readers only: pthread_create and __tls_get_addr (TLS access).
> > I would first describe what these functions / features intend to do and
> > what synchronization problem they have to solve before going into
> > details of how that's implemented. This should be sufficiently detailed
> > to answer all questions such as "why do I need this critical section
> > described below?".
> i probably don't know all the details of that.
That's okay given that this is a complex matter and we're still
investigating all its parts. Nonetheless, if you can't describe the
solution at a high level because you don't know all the details, it's
(1) a good indication that this needs more work (IOW, "if you can't
explain it, it may not be ready") and (2) worth pointing out in a
comment in the code so that everyone else is aware that this piece of
code is still WIP.
> >> + The writers hold the GL(dl_load_lock), but the readers don't, so atomics
> >> + should be used when accessing these globals.
> > This covers the low-level data races, but it doesn't yet explain why
> > you'd want the critical sections. Think about which atomicity and
> > ordering guarantees you need to make your high-level synchronization
> > scheme work.
> i can write more about the synchronizations i added,
> but not much about what other state dl_load_lock
Either should be an improvement. What I wanted to ask for is not just
more information, but also how it's presented and structured. For
example, knowing which state a lock "protects" is just a bit of the
information necessary; it often makes more sense to think about the
high-level atomicity and ordering constraints one needs first, and then
see how to solve that with mutual exclusion and locks, and then it falls
out which data (or rather logical state and state transactions!) are
"protected" by a lock.
> >> + dl_open_worker (called from dlopen) for each loaded module increases
> >> + GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
> >> + slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
> >> + the next generation number GL(dl_tls_generation)+1. Then it increases
> >> + GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.
> > You describe this as if it were sequential code. Is it, or do you need
> > more detail?
> yes this is sequential code in dlopen.
> (but scattered around in many functions so it
> is not immediately obvious)
> >> + This last write is release mo so previous writes can be synchronized.
> > Terminology: the matching acquire MO load synchronized with the release
> > MO write. The writes sequenced before the release MO store than happen
> > before stuff sequenced after the acquire MO load.
> > Also, uppercase MO so it's clear it's an abbreviation.
> >> -
> >> - /* Check if the current dtv is big enough. */
> >> - if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> >> + size_t gen_count;
> >> + size_t dtv_slots;
> >> +
> >> + /* Synchronize with the release mo store in dl_open_worker, modules with
> >> + larger generation number are ignored. */
> >> + gen_count = atomic_load_acquire (&GL(dl_tls_generation));
> >> + /* Check if the current dtv is big enough. GL(dl_tls_max_dtv_idx) is
> >> + concurrently modified, but after the release mo store to
> > "after" in terms of which relation? This is a good example of a place
> > where you could probably refer to a part of the high-level
> > synchronization scheme.
> if the load of *_idx "happens after" the store of *_generation
> then it is an upper bound to modids at that generation.
The "happens after" is either enforced by something, and then you should
refer to that piece. Or if the loads reads from (as in the reads-from
relation in the memory model) the store, then it either may itself
enforce a happens-before, or use something else.
IOW, I still don't understand what you really mean :)
> >> + GL(dl_tls_generation) it always remains a modid upper bound for
> >> + previously loaded modules so relaxed access is enough. */
> >> + dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
> >> - map = listp->slotinfo[cnt].map;
> >> + /* Synchronize with the release mo store in _dl_add_to_slotinfo in
> >> + dlopen, so the generation number read below is for a valid entry.
> > Isn't it in dl-tls.c?
> yes, but it is called from dlopen.
So please say that it is called from dlopen, not "in dlopen".
> >> + TODO: remove_slotinfo in dlclose is not synchronized. */
> >> + map = atomic_load_acquire (&listp->slotinfo[cnt].map);
> >> if (map == NULL)
> >> /* Unused entry. */
> >> continue;
> > It might be worthwhile to point out why missing out on a concurrent
> > update to map would be okay here, or why it can't actually happen.
> concurrent update of the relevant maps don't happen
> (except in dlclose.. but that bug is not fixed)
> only slotinfo entries that should be ignored here may
> change concurrently (and the dtv/tls of those are
> initialized lazily at first access).
> >> + size_t gen = listp->slotinfo[cnt].gen;
> >> + if (gen > gen_count)
> >> + /* New, concurrently loaded entry. */
> >> + continue;
> >> +
> >> /* Keep track of the maximum generation number. This might
> >> not be the generation counter. */
> >> - assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
> > See my comments elsewhere in the thread.
> i can keep the assert but the gen > gen_count cases
> must be skipped here so the only way the assert can
> fail is if the generation number overflows.
> (otherwise gen <= gen_count <= GL(dl_tls_generation) holds)
> >> - maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
> >> + maxgen = MAX (maxgen, gen);
> >> dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
> >> dtv[map->l_tls_modid].pointer.is_static = false;
> >> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
> >> }
> >> total += cnt;
> >> - if (total >= GL(dl_tls_max_dtv_idx))
> >> + if (total >= dtv_slots)
> > Could this have changed in the meantime, and thus be large enough? You
> > change here what is potentially more than one memory access (depending
> > on what the compiler does to this, strictly speaking, sequential code)
> > to a single one in the beginning.
> dtv_slots at the beginning is already an upper bound to
> the relevant modids (and that's how much dtv we allocated),
> so we can finish the iteration if total visited modules
> grows beyond that.
> rereading it does not cause a problem, but it does not
> help either (can be smaller or larger as well, depending
> on what dlopen/dlclose happened concurrently, the
> slotinfo list does not shrink though)
While we can discuss the analysis here, eventually the outcome needs to
be included as comments in the code, as part of the patch. If the
intent / high-level scheme isn't obvious to see (which is the case given
that we have to discuss it here), then it should be documented, IMO.
> >> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
> >> the first slot. */
> >> assert (idx == 0);
> >> - listp = prevp->next = (struct dtv_slotinfo_list *)
> >> + listp = (struct dtv_slotinfo_list *)
> >> malloc (sizeof (struct dtv_slotinfo_list)
> >> + TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> >> if (listp == NULL)
> > There's another store to dl_tls_generation around here that you're not
> > accessing atomically.
> yes, the oom code path
> that looked broken logic: it does not check for overflow (bug)
> it leaks memory (another bug) and the comment roughly says
> "the application will crash anyway" so i wanted to avoid messing
> with it.
> but yes that should be atomic too.
> >> @@ -939,9 +1000,15 @@ cannot create TLS data structures"));
> >> listp->next = NULL;
> >> memset (listp->slotinfo, '\0',
> >> TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> >> + /* _dl_allocate_tls_init concurrently walks this list at thread
> >> + creation, it must only observe initialized nodes in the list.
> > Another bit for the high-level scheme.
> >> + See the CONCURRENCY NOTES there. */
> >> + atomic_store_release (&prevp->next, listp);
> >> }
> >> /* Add the information into the slotinfo data structure. */
> >> - listp->slotinfo[idx].map = l;
> >> listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
> >> + /* Synchronize with the acquire load in _dl_allocate_tls_init.
> > Say why you need to enforce this, or point to part in the higher-level
> > scheme where this is explained.
> _dl_allocate_tls_init walks the slotinfo list beyond
> the entries it actually cares about, the extra entries
> are either unused (zero) or initialized here with
> large gen number.
> so those extra entries must be possible to recognize.
> i used sync so either map==0 or gen>gen_count is
> visible to _dl_allocate_tls_init for new entries.
> however now i see yet another bug here: the gen can
> overflow and become 0 that way too, so the right fix
> is to check for (map==0 || gen==0 || gen>gen_count)
> and no sync is needed only relaxed atomic access.
> but i have to verify gen==0 cannot happen otherwise
> so it's ok to skip it in _dl_allocate_tls_init.
You'll probably have noticed that I'm asking several small corner-case
questions. I my experience, doing that is helpful to figure out whether
the synchronization really works correctly all the time. And a good way
to do this by yourself, before you send a patch, is to try to write a
thorough explanation of the concurrent code you're working on. I've
always found this helpful. The additional benefit is that you'll have
comments ready so others can understand your code easier. If you have a
thorough explanation, you also already have answers for all the little
questions, which may or may not(!) have obvious answers. Therefore, I'd
strongly suggest to work on explanations too, even if that doesn't yield
a larger LOC count. But the complexity of concurrent code cannot be
measured in LOC anyway.