This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v6] Also use l_tls_dtor_count to decide on object unload (BZ #18657)
- From: Torvald Riegel <triegel at redhat dot com>
- To: Siddhesh Poyarekar <siddhesh at redhat dot com>
- Cc: libc-alpha at sourceware dot org, roland at hack dot frob dot com, carlos at redhat dot com
- Date: Tue, 21 Jul 2015 22:29:27 +0200
- Subject: Re: [PATCH v6] Also use l_tls_dtor_count to decide on object unload (BZ #18657)
- Authentication-results: sourceware.org; auth=none
- References: <55A903FD dot 3060503 at redhat dot com> <1437413472-21674-1-git-send-email-siddhesh at redhat dot com>
On Mon, 2015-07-20 at 23:01 +0530, Siddhesh Poyarekar wrote:
> diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
> index 7a7d3d6..dc51ab5 100644
> --- a/stdlib/cxa_thread_atexit_impl.c
> +++ b/stdlib/cxa_thread_atexit_impl.c
> @@ -16,6 +16,50 @@
> License along with the GNU C Library; if not, see
> <http://www.gnu.org/licenses/>. */
> +/* CONCURRENCY NOTES:
> + The functions __cxa_thread_atexit_impl, _dl_close_worker and
> + __call_tls_dtors are the three main routines that may run concurrently and
> + access shared data. The shared data in all possible combinations of all
> + three functions are the link map list, a link map for a DSO and the link map
> + member l_tls_dtor_count.
OK. Perhaps just say a bit more specifically that this documents
concurrency for the atexit implementation (or whatever is the proper
> + __cxa_thread_atexit_impl takes the load_lock before modifying any shared
> + state and hence can concurrently execute multiple of its instances in
> + different threads safely.
I have a slight preference for "acquires" over "takes", but I know that
"take a lock" is used often.
If your not specifically talking about only modifications, I think it's
better to use "accessing" instead of "modifying". The ++ has a load,
too, and it tells the reader that both loads and stores are meant, not
Also, maybe say "multiple of its instances can safely execute
concurrently" to make it passive (ie, unspecified who/what executes
> + _dl_close_worker takes the load_lock before modifying any shared state as
> + well and hence can concurrently execute multiple of its own instances as
> + well as those of __cxa_thread_atexit_impl safely.
> + __call_tls_dtors does not take the load lock, but reads only the link map
> + of the current DSO
Perhaps briefly say why this is okay?
> and its member l_tls_dtor_count. It has to ensure that
> + l_tls_dtor_count is decremented atomically
If it decrements l_tls_dtor_count, it doesn't only read this and the
Perhaps it's easier to understand if you start by saying that not all
accesses to l_tls_dtor_count are protected by a lock and that we thus
need to synchronize using atomics. Then perhaps briefly list the
respective accesses / functions.
> + Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
> + unloads the DSO, thus dereferencing the current link map.
Good. Perhaps say that this is what we want to do. This is the
abstract goal that we have / what we want to implement.
> Hence, for
> + concurrent execution of _dl_close_worker and __call_tls_dtors, it is
> + necessary that:
> + 1. The DSO unload in _dl_close_worker happens after l_tls_dtor_count
You mention just the variable, but not the effect, state transition, or
piece of program logic. (In contrast, "DSO unload" is an effect.)
> , i.e.
> + the l_tls_dtor_count load does not get reordered to after the DSO is
> + unloaded
I know what you are trying to say, but I think it's better if we don't
start talking about "reordering" because that's at somewhat the wrong
level. Reordering is what a compiler or a CPU might do, and we don't
want to reason about that. Instead, we want to start from the
high-level goal (eg, DSO unload doesn't happen if the DSO is still
used), and reason from there what we need in terms of happens-before
relations, and then implementation things like MOs.
> + 2. The link map dereference in __call_tls_dtors happens before the
> + l_tls_dtor_count dereference.
> + to ensure that we eliminate the inconsistent state where the DSO is unloaded
> + before it is dereferenced in __call_tls_dtors,
That's the piece of information to start with, I believe.
Then you can say that l_tls_dtor_count acts as something like a
Then you can follow from that that DSO use happens before decrement the
count happens before we unload. We unload only if the count is zero, so
we end up with a typical release-store / acquire load pattern.
> which could happen if any of
> + the above two pairs of sequences were reordered.
> + To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
> + have release semantics and the load in _dl_close_worker should have acquire
> + semantics.
I'm not sure which level of detail we want to use here. We could say
that we need the release/acquire pair so that the reads-from ensures a
happens-before. We could also assume that this pattern is well-known to
(IOW, that's a question for the community to decide; I don't know enough
about what others know or aren't too familiar with.)
> + Concurrent executions of __call_tls_dtors should only ensure that the value
> + is modified atomically; no reordering constraints need to be considered.
If you are talking about the decrement operation, then perhaps say that.
Saying that the order in which they decrement/increment doesn't matter
for us is useful information though.
> + Likewise for the increment of l_tls_dtor_count in
> + __cxa_thread_atexit_impl. */
> #include <stdlib.h>
> #include <ldsodefs.h>
> @@ -49,9 +93,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
> new->next = tls_dtor_list;
> tls_dtor_list = new;
> - /* See if we already encountered the DSO. */
> + /* We have to take the big lock to prevent a racing dlclose from pulling our
> + DSO from underneath us while we're setting up our destructor. */
> __rtld_lock_lock_recursive (GL(dl_load_lock));
> + /* See if we already encountered the DSO. */
> if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
> ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
> @@ -62,16 +108,14 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
> program (we hope). */
> lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
> - /* A destructor could result in a thread_local construction and the former
> - could have cleared the flag. */
> - if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
> - lm_cache->l_flags_1 |= DF_1_NODELETE;
> - new->map = lm_cache;
> - new->map->l_tls_dtor_count++;
> + /* Relaxed since the only shared state here is l_tls_dtor_count and hence
"Relaxed MO", just to be a bit more clear.
> + there are no reordering issues. */
I believe relaxed MO is fine, but I don't really understand the reason
as you have stated it. I think the question to ask here is who observes
the effect (ie, the increment; IOW, which loads observe this store)?
And for those observers, are there any that need a relation between this
effect and what they do? Or if not, then say that.
> + atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
> __rtld_lock_unlock_recursive (GL(dl_load_lock));
> + new->map = lm_cache;
> return 0;
> @@ -83,19 +127,15 @@ __call_tls_dtors (void)
> while (tls_dtor_list)
> struct dtor_list *cur = tls_dtor_list;
> - tls_dtor_list = tls_dtor_list->next;
> + tls_dtor_list = tls_dtor_list->next;
> cur->func (cur->obj);
> - __rtld_lock_lock_recursive (GL(dl_load_lock));
> - /* Allow DSO unload if count drops to zero. */
> - cur->map->l_tls_dtor_count--;
> - if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
> - cur->map->l_flags_1 &= ~DF_1_NODELETE;
> - __rtld_lock_unlock_recursive (GL(dl_load_lock));
> + /* Ensure that the MAP dereference is not reordered below the
> + l_tls_dtor_count decrement. This ensures that it synchronizes with
> + the load in _dl_close_worker and keeps this dereference before the DSO
> + unload. See CONCURRENCY NOTES for more detail. */
> + atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
Good, but I'd not talk about reordering. Instead I's say that we want
the MAP dereference to happen before we give it up and thus before
potential DSO unload. That states the high-level goal. Then cover the
low-level implementation using the second sentence above that you
Overall, with perhaps the exception of the reasoning about the relaxed
MO on the increment, and based on the conversation that I had with
Siddhesh, I think the synchronization in this patch is sound. Thus, I
think that we can also do more word-smithing on how we explain it after
the release (so that this doesn't hold up the release then).
I do think the word-smithing is useful, because good concurrency
documentation ensures that we all are more likely to not misunderstand
each other and understand what's going on in the code. But I guess that
we'll need some more examples and iterations before we have figured out
all the details of how to document this so that it's clear for everyone.