This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS


On Wed, 2015-06-03 at 10:20 -0300, Alexandre Oliva wrote:
> On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> >> -  /* Make sure that, if a dlopen running in parallel forces the
> >> -     variable into static storage, we'll wait until the address in the
> >> -     static TLS block is set up, and use that.  If we're undecided
> >> -     yet, make sure we make the decision holding the lock as well.  */
> >> -  if (__glibc_unlikely (the_map->l_tls_offset
> >> -			!= FORCED_DYNAMIC_TLS_OFFSET))
> >> +  /* If the TLS block for the map is already assigned to dynamic or to
> >> +     static TLS, avoid the lock.  Be careful to use the same value for
> >> +     both tests; if we reloaded it, the second test might mistake
> >> +     forced dynamic for an offset.  Now, if the decision hasn't been
> >> +     made, take the rtld lock, so that an ongoing dlopen gets a chance
> >> +     to complete, and then retest; if the decision is still pending,
> >> +     force the module to dynamic TLS.  */
> >> +  ptrdiff_t offset = atomic_load_relaxed (&the_map->l_tls_offset);
> 
> > This should document why relaxed MO is sufficient
> 
> The comments I added attempted to do so.  Can you please try to indicate
> what appears to be missing in them?
> 
> As for why we need relaxed MO, the first question is why we need atomics
> to begin with.  We don't really need anything special there, we just
> need to make sure we perform the newly-added test using the same value
> used in the preexisting unguarded test.

So you think a reload by the compiler would be bad.  This can only be
bad if there is concurrent modification, potentially; either in a signal
handler by the same thread (ie, reentrant), or by another thread.  Thus,
therefore we need the atomic access; using a plain load would be a data
race, and we don't do that in new code or when fixing code.

> I was tempted to load the value
> into a var and use the asm("":"+X"(var)) idiom, to make sure the var is
> not reloaded, but I figured a relaxed atomic load would express that
> more nicely.

Yes.  This is concurrency, and we have atomic accesses for it :)

> Now, if you were to ask me for formal proof that relaxed is enough,

I'm not asking for a formal proof, I'm looking for an explanation that
clearly tells others looking at this code in the future why this is
supposed to work.  Relaxed accesses do work in some cases, but in many
others they aren't sufficient; therefore, we want to document why they
are sufficient in the particular case.

> I
> would have to tell you it probably isn't.  Although the memory used for
> l_tls_offset in a link map will necessarily be initialized by a thread
> holding the load lock, if that memory had been used for anything else
> before, another thread might still access a previous value without an
> acquire, and if what it finds there happens to be
> FORCED_DYNAMIC_TLS_OFFSET, we'd get unexpected behavior.

AFAIU, you seem to speak about memory reuse unrelated to this
specifically this particular load, right?  If that's a concern, let's
open a bug or such for this.  But that sounds like an issue
independently of whether the specific load is an acquire or relaxed
load.  Or do I misunderstand you?

We established before that you want to prevent reload because there are
concurrent stores.  Are these by other threads?

If so, are there any cases of the following pattern:

storing thread;
  A;
  atomic_store_relaxed(&l_tls_offset, ...);

observing thread:
  offset = atomic_load_relaxed(&l_tls_offset);
  B(offset);

where something in B (which uses or has a dependency on offset) relies
on happening after A?

If there are such cases, we need a release store and acquire load,
unless we can get around that for other reasons.  If there are none,
this is a reason why a relaxed MO load can be sufficient.
Ideally, there would be documentation of what the major happens-before
orderings are that the TLS code relies on, and you could just point to
those and say that those don't apply here.  In absence of that, I
suggest just briefly describing why such dependencies on other things
happening before do not exist.  For example, if the memory location
pretty much exists on its own (logically) and there are no dependencies
to other stuff (e.g., a statistics counter), then relaxed is fine.

> That said, we
> seem to have got along fine with a non-atomic load, so making it a
> relaxed atomic load doesn't make it any worse.

It won't -- but if you're fixing it anyway, it won't hurt to do it right
or at least leave a code comment that something may need to be fixed or
reviewed in the future (e.g., "/* XXX Don't know wehther relaxed MO is
sufficient.  */").  You must have been thinking about it to decide you
don't want the reload -- it would be good if could finish that.

> This is just one out of many examples of how the current TLS
> implementation makes assumptions about the underlying memory model that
> are not documented and that are hardly guaranteed by the language or by
> the broad variation of target architectures.  It's like we're driving,
> blind, at high speed, on a busy freeway, facing the wrong way.  Yet
> somehow it at least appears to work, and I guess most of us are too
> scared to touch it in any significant way ;-)
> 
> I've already mentioned other fragile aspects in the way we deal with
> slotinfo, in how DTV growth is AS-Unsafe, in how dynamic TLS Descriptors
> are using volatile where atomics would have been a better choice (they
> didn't exist back then :-), and the list of suspicious behavior related
> with TLS just keeps growing.  I like your goal of getting it documented
> and adjusted to current standards, but given our differences in
> terminology and background, I very much doubt any such docs coming from
> me will meet your expectations.
> 
> Perhaps the best way to go about this project is to make it Q&As: you
> ask about the properties of something you're trying to document or clean
> up, I (or someone else) answer them, then, after any further
> clarifications, you turn that into documentation and code changes.
> Would you like to do that?

I'd need to find the time to do that first :)  I also believe that many
people in the project should be familiar with how we do synchronization.
With that in mind, it would seem efficient if people already familiar in
a certain area within glibc would update the synchronization in this
area.

Much of it is knowing how synchronization is *meant* to work --
implementing it properly is to a large extent not more than following
some rules how to do it and document it.

In a way, we're doing the Q&A already:  :)
I'm trying to find out what you know about the intent behind the TLS
synchronization, and I'm trying to work with you on putting this into
words that are precise and on top of the memory model we want to use.

> 
> >> __rtld_lock_lock_recursive (GL(dl_load_lock));
> 
> > Why do we need the lock at all?
> 
> It's right there in the comments, both before and after the change,
> though I tried to make the reason more immediately apparent.  Hint:
> concurrent dlopen.
> 
> > Are we protecting just access to
> > l_tls_offset or to other memory locations as well?
> 
> We must wait for dlopen to complete.  It might assign the module to
> static TLS if some IE relocation requires it.

(I'm asking because I'm trying to find out whether there are those B
depends on A dependencies I mentioned above, so that we know what to
document...)

Does dlopen just have to decide about this value (and is this value
pretty much "free-standing"), or does it have to do other initialization
or other effects logically related to this value (ie, is there an A that
matters for B)?  If not, we can say that in our documentation.

If it's really just about waiting for dlopen because dlopen absolutely
has make the decision, than the lock seems right (and we have our
documentation).

If only one of dlopen or we here have to decide about the value, then
using a CAS both here and dlopen could work as well.

> >> -      if (__glibc_likely (the_map->l_tls_offset == NO_TLS_OFFSET))
> >> +      offset = the_map->l_tls_offset;
> 
> > l_tls_offset is effectively concurrently accessed data,
> 
> > Likewise for the store below.
> 
> Yeah, but these are guarded by the lock.

But there are concurrent (atomic) loads to it, right?  Therefore, this
would be a data race, and we don't do those in new code or in code that
we change.

> Sure, to be absolutely safe, we want all loads and stores to be atomic
> (current code assumes aligned word accesses are relaxed atomic, with
> acquire and release implied by the locks), but I'm not inclined to
> revisit all that just to fix this regression.  A minimal change is
> better for that.

I disagree.  You added an atomic load on the consumer side (rightly
so!), so you should not ignore the producer side either.  This is in the
same function, and you touch most of the lines around it, and it's
confusing if you make a partial change.  Doing The Right Thing here is
not a lot of overhead.  Splitting this into a separate patch seems to be
more work than just making the change in your patch.

> 
> >> +static int running = 1;
> >> +  while (running)
> 
> > That's a data race with the modification do_end, or isn't it?  Use
> > atomics, a semaphore (with trywait), or trylock.
> 
> Andreas, any preferences as to how to adjust your (?) testcase to meet
> Torvald's standards?

Let me point out that we do have consensus in the project that new code
must be free of data races.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]