This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v6] Also use l_tls_dtor_count to decide on object unload (BZ #18657)
- From: Torvald Riegel <triegel at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: Siddhesh Poyarekar <siddhesh at redhat dot com>, libc-alpha at sourceware dot org, roland at hack dot frob dot com
- Date: Mon, 03 Aug 2015 17:36:29 +0200
- Subject: Re: [PATCH v6] Also use l_tls_dtor_count to decide on object unload (BZ #18657)
- Authentication-results: sourceware.org; auth=none
- References: <55A903FD dot 3060503 at redhat dot com> <1437413472-21674-1-git-send-email-siddhesh at redhat dot com> <55B14684 dot 7080601 at redhat dot com> <1437727964 dot 20974 dot 21 dot camel at localhost dot localdomain> <55B28CF8 dot 7030005 at redhat dot com>
On Fri, 2015-07-24 at 15:07 -0400, Carlos O'Donell wrote:
> On 07/24/2015 04:52 AM, Torvald Riegel wrote:
> > On Thu, 2015-07-23 at 15:54 -0400, Carlos O'Donell wrote:
> >> On 07/20/2015 01:31 PM, Siddhesh Poyarekar wrote:
> >>> v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
> >>> with a couple of macros.
> >>> When an TLS destructor is registered, we set the DF_1_NODELETE flag to
> >>> signal that the object should not be destroyed. We then clear the
> >>> DF_1_NODELETE flag when all destructors are called, which is wrong -
> >>> the flag could have been set by other means too.
> >>> This patch replaces this use of the flag by using l_tls_dtor_count
> >>> directly to determine whether it is safe to unload the object. This
> >>> change has the added advantage of eliminating the lock taking when
> >>> calling the destructors, which could result in a deadlock. The patch
> >>> also fixes the test case tst-tls-atexit - it was making an invalid
> >>> dlclose call, which would just return an error silently.
> >>> I have also added a detailed note on concurrency which also aims to
> >>> justify why I chose the semantics I chose for accesses to
> >>> l_tls_dtor_count. Thanks to Torvald for his help in getting me
> >>> started on this and (literally) teaching my how to approach the
> >>> problem.
> >>> Change verified on x86_64; the test suite does not show any
> >>> regressions due to the patch.
> >> This version looks awesome. Thanks for the awesome work!
> >> OK for 2.22 with suggestions (see below).
> >> I had lots of fun coming up with orderings between the release and acquire
> >> pairs and the mutex locks which I treated as release+acquire barriers
> >> for the purpose of the analysis.
> > Don't simply treat lock/unlock as fences (if that's what you meant by
> > "barriers"). Technically, they behave differently. Whereas a fence can
> > work in combination with other atomics accesses on other memory
> > locations, unlock calls only synchronize with later lock calls on the
> > same mutex (there is a total order of all lock/unlock calls per mutex).
> > Thus, you can model lock/unlock as acquire/release operations on the
> > mutex (ie, how one would implement them), but they are not fences.
> But aren't acquire/release operations simply made up of groups of fences
> themselves? Like LoadLoad+LoadStore == Acquire? Such things are still
> abstract and not machine specific and still express the same semantics?
The specific implemention (ie, mostly HW in this case, but the compiler
is still involved) will likely use fence(-like) instructions on most(?)
architectures that need them. But there's nothing that prohibits a
system from distinguishing between fences that work with all future/past
memory accesses and acquire/release operations that just work in
conjunction with the specific memory access they are attached to.
Thus, to stay on the portable side, don't assume that fences and
acquire/release/... operations are necessarily the same.
> >> My own rough review notes for reference:
> >> __cxa_thread_atexit_impl
> >> atomic_fetch_add_relaxed l_tls_dotr_count +1
> >> - dl_load_lock held.
> >> - Excludes any unloads so forbids __dl_close_worker
> >> - Does not forbid a thread being killed.
> > OK.
> >> - Deferred cancel happens after lock release
> >> so not concurrent with anything else.
> >> And thread eventually calls __call_tls_dtors
> >> with release barrier.
> >> - dl_load_lock is release+acquire barrier, which allows relaxed
> >> ordering atomic_fetch_add to l_tls_dtor_count, because after
> >> unlock you will either see consistent MAP or you won't see
> >> MAP at all since load/stores can't leak past the unlock.
> > This isn't correct. lock/unlock will synchronize with other lock/unlock
> > and thus may affect ordering for l_tls_dtor_count accesses, but they
> > aren't fences.
> In practice though aren't the lock/unlock operations fences?
They will most likely be atomic memory accesses to some part of the lock
data structure that use acquire/release memory orders. But not quite
fences as they are defined by the C11 memory model.
> How else does the a locking thread get to see the effects of what has
> happened before the last unlock?
Maybe we have a terminology misunderstanding here. In the C11 model,
fences are what it defines fences to be (see above). "Memory orders"
are what you use to require specific ordering constraints to take
effect. On the level of the arch-specific code, both may be implemented
using HW fence instructions.
> > My understanding of how this works based on the discussions with
> > Siddhesh is that due to the critical sections using dl_load_lock,
> > __cxa_thread_atexit_impl and __dl_close_worker are mutually exclusive
> > and totally ordered. Calling __cxa_thread_atexit_impl after
> > __dl_close_worker isn't legal, so __dl_close_worker will observe any
> > increment because of synchronization via dl_load_lock (not because of
> > the unlock/lock being fences).
> Your description in the paragraph above is correct, but you could break
> down the lock into it's behaviour at the fence level and reason about
> it there also, which is what I did. If that's not a good thing to do
> then I'm happy to learn.
I don't think reasoning about the implementation-level equivalent of
locks is really helpful. I'd say it's easier to just be aware that
critical sections using the same lock instance are mutually exclusive
and totally ordered wrt themselves, and that this order is reflected in
However, if you think it's easier to break this down to the level of
acquire/release pairs that synchronize with each other, then that's fine
too. I wouldn't do it, personally, but it if works for you that's fine.
Just don't treat locks as equal to what C11 fences are :)
> Are you suggesting simply that I consider locks at a different level
> of abstraction from both acquire/release atomic operations, and from
That would be my suggestion, yes (see above). But, YMMV.