This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS


On Tue, 2015-06-09 at 19:19 -0300, Alexandre Oliva wrote:
> On Jun  9, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> > You said there are no other dependencies, and we just want to reach
> > consensus on the final value of l_tls_offset.  So, from the
> > perspective of just this synchronization state machine you gave,
> > there's no other data that's relevant.  Now you say there is other
> > stuff.  What's true?
> 
> The latter.  When I wrote that, I was convinced we'd covered all the
> cases of initialization performed by dlopen because I had had failed to
> take into account the subsequent cross-thread static TLS area
> initialization by a subsequent dlopen that assigns a previously-loaded
> TLS variable to static TLS.
> 
> 
> It looks like we could still avoid the lock in tls_get_addr by
> reorganizing dlopen to choose the offset, initialize and release the
> static blocks, and only then CAS the offset, failing if l_tls_offset was
> changed along the way.

Seems so, based on what you wrote.  And then we'd need the CAS to use
release MO and the loads that may read-from that CAS to use acquire MO.

> > Remember the loads on tile that got mentioned in a previous discussion
> > we had?
> 
> 'fraid I don't.  Reference?

I didn't find the email thread after looking for a while.  What I
remember is that tile had operations that are not truly atomic (IIRC,
don't necessarily reload a whole cache line from memory, or some such).

> >> - reloads would not be a problem for the pattern used in the second
> >> version of the patch
> 
> > Yeah, one could speculate about what a compiler may or may not do in
> > this *specific* piece of code.  But that's not the level of abstraction
> > we want to use as base for our reasoning.
> 
> What is clear to me is that our reasonings and thought frameworks are so
> different that your personal preferences don't apply to my way of
> thinking, and vice versa.  Why is why I'm ok with answering questions
> you may have about synchronization patterns I'm familiar with in GNU
> libc, but not at all ok with writing the documentation.  From our
> discussions so far, I am sure any such documentation I write will be
> regarded as useless for you.

I won't say it's useless if you try to convey knowledge you have to
others.  I may disagree how it's done or expressed, or don't understand
what you're trying to say -- but even then I'd consider it input and a
start of something that just needs further iteration.

> Because you and I think in different
> terms, different primitives, different abstraction layers.  I'd build
> the reasoning from my perspective, and it wouldn't match yours.  And
> vice-versa.

I don't think this is a question of my perspective vs. your perspective.
Eventually, we want to have consensus in the project about how we,
collectively, document and reason about concurrency.  I've been arguing
for what I think is the right layers of abstractions, terminology,
memory model, etc., based on my experience in this field.  If that turns
out to be not what works best for the project, that's fine, but we need
to discuss this.  Everyone may have a favourite way of thinking, but the
project still needs a common "concurrency speak".

> > For example, assume the compiler is aware of the code for
> > "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't
> > access l_tls_offset in some way (which it doesn't).   You access
> > l_tls_offset inside and out of a critical section, so by
> > data-race-freedom, there must not be concurrent write accesses.  So it
> > does not actually have to reload l_tls_offset inside of the critical
> > section, but use the value of your initial load.  This correct and
> > reasonable-for-C11 compiler optimization breaks your synchronization.
> 
> Correct and reasonable-for-C11 except for the bold and bald assumption
> that the a lock operation is not a global acquire.  The compiler is
> forbidden from removing the second load because of the existence of the
> lock.  Now, that requirement comes from POSIX, not from the C standard.

POSIX doesn't guarantee anything about plain memory accesses that do
have data races, right?  This isn't about the lock itself and whether it
"synchronizes memory", but about the data race and what the compiler can
do based on assuming DRF programs.
If the accesses to l_tls_offset were sem_getvalue and sem_post (or
relaxed-MO atomics), the compiler would have to reload because of the
lock acquisition and the acquire-MO load part of it.

> >> - speculative stores that could affect this pattern are not permitted by
> >> the relevant standards AFAICT
> 
> > The standard permits to speculatively store a value, if the target is
> > not volatile and the speculative store does not create a data race.
> 
> *if* the abstract machine would have stored something in the variable to
> begin with.

No, not generally.  The program must behave as-if executed by the
virtual machine.  But behavior is defined as all accesses to
volatile-qualified variables and I/O, so the program is allowed to
speculatively store if it doesn't affect the volatile / I/O behavior of
the program.

That's why I said the target must not be volatile.  If adding a
speculative store does not create a data race (and there's different
ways for a compiler to detect that), then there's no other thread that
can observe the speculative store.

Read-only mapped memory isn't considered by the standard, but compilers
may.  Even then, this is on a page granularity, and compilers can detect
whether something else in the page might be stored to.

> In the case at hand, it wouldn't, so, no speculative
> writes.

Just to be clear, this is the code snippet we're talking about:
  if (*p != NEGATIVE) {
    take lock;
    if (*p == UNKNOWN)
      *p = NEGATIVE;
    release lock;

Your program told the compiler that there are no concurrent
modifications of the variable by accessing *p inside and outside of the
critical section.  That is, only we modify it while running this piece
of code.  That means we can ignore the lock, and we end up with this,
conceptually:
// *p must be UNKNOWN OR POSITIVE, due to the first if.
if (*p == UNKNOWN)
  *p = NEGATIVE;

(And assume the compiler doesn't try to support read-only mapped memory,
or it observed that elsewhere, the program writes to the page p is part
of.)
Then, the compiler can speculatively store NEGATIVE to *p, and fix up
afterwards if it was wrong.

> > No, this is simply not true in general.  You can argue about likelihood
> > in this *particular* case, but then you're doing just that.
> 
> Indeed, that's just what I'm doing.

But that's not useful.  First, we want rules that apply and are safe in
general, not rules that require case-by-case reasoning.  Second, we
don't want to depend on something being unlikely -- we want the thing to
always work.

> > If you think it's unlikely compilers will try to optimize, have a look
> > at this:
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
> 
> >> Since Carlos and Siddhesh took this over, I'll leave it for them too.
> 
> > So you will stop working on this?
> 
> Once Carlos wrote he and Siddhesh would take care of the issue, I did.
> 
> It doesn't mean I'm not willing to share the knowledge I got from
> studying the intricacies of TLS synchronization any more.

And I appreciate that.

> Now that the patch is withdrawn because it is broken, we can even focus
> on more productive questions and answers about it.  But this only makes
> sense (as opposed to being a waste of time) if there's a commitment to
> turn the conversation into proper documentation.  Do we have that from
> anyone?

I suppose Carlos and Siddhesh will do that then.  I have no plans to
work on TLS specifically, but will help with checking that the
concurrency bits are sound (including how they get documented).

> > If you intend to work on patches affecting synchronization in the
> > future, please remember the feedback you got in this patch.
> 
> I'll rather try to avoid it.  It's just too hard for you and I to
> communicate in this field, and I assume it's as frustrating for you as
> it is for me :-(
> 
> >> We'd already determined the release fence was needed and taken care of
> >> by other means.
> 
> > Huh?
> 
> rtld that loads and defines the variable and then releases the rtld lock
> (and thus the memory) happens-before any well-defined use of the
> variable (without some happens before, how could it safely get ahold of
> the relocations used in the TLS access model?), even if it doesn't take
> the rtld lock, as covered in other messages upthread.

To avoid confusion, I assume you are referring to such a pattern:
Thread 1:
  data = 1;
  flag = 1;
  unlock;

Thread 2:
  if (flag) foo = data;

POSIX doesn't guarantee anything for non-data-race-free plain memory
accesses, and this has data races.
Even if we'd assume that this isn't the case, and unlock would magically
make all prior modifications become visible globally, there's still no
guarantee that the load of flag happens after the unlock.

You could make Thread 1 do this (and assume that unlock is a release
fence, and all plain memory accesses are indeed atomic):
  data = 1;
  unlock;
  flag = 1;

But then you'd still need something at the observer side, to enforce
that the load of data won't be performed before the load of flag.

If the necessary happens-before is ensured through other sync, then
we're good though.

> (That's a case
> not to be confused with the subsequent dlopen that processes relocations
> referencing the variable and assigns it to static TLS, described in this
> message.)
> 
> >> > Also note that an unlock operation on a lock is *not* guaranteed to have
> >> > the same effects as an atomic_thread_fence_release.
> >> 
> >> *nod*
> >> 
> >> > The hardware may or may not treat unlocks (ie,
> >> > release stores on a *particular* memory location)
> >> 
> >> It is my understanding that, per POSIX, unlocks ought to release all
> >> prior stores made by the thread, and locks must acquire all
> >> previously-released stores by any thread.  I.e., they don't apply to
> >> single memory locations.  Locks don't even know what particular memory
> >> locations they guard, if any.  So I can't make sense of what distinction
> >> you intend to point out above.
> 
> > We implement POSIX for users of glibc, but we do not implement on top of
> > POSIX inside of glibc
> 
> Is this documented anywhere?

https://sourceware.org/glibc/wiki/Concurrency
says that we're using the C11 memory model.

> If we're breaking well-set expectations
> WRT locks, we'd better document that.

First of all, the POSIX semantics are not sufficient for us, because
they don't define atomics and also don't try to give programs with data
races some well-defined meaning (if they tried, they'd fail because it
would break plenty compiler optimizations that are all used in
practice).

Second, we don't implement the stronger "synchronizes memory" semantics
POSIX seems to want to have.  That's not something new.

> > In C11, there's a distinction between a release-MO fence and a mutex
> > unlock operation (i.e., a release-MO store).
> > Does that answer your question?
> 
> I don't think I asked a question, but no, it doesn't help me understand
> what you meant by "unlocks (ie release stores on a particular memory
> location)".  In my understanding, unlocks are not about particular
> memory locations, so it comes across as a contradiction and your attempt
> to answer some unstated underlying question does nothing to clarify it.

See C11, 5.1.2.4p6:
For example, a call that acquires a mutex will perform an acquire
operation on the locations composing the mutex. Correspondingly, a call
that releases the same mutex will perform a release operation on those
same locations.

Thus, there will be synchronizes-with (and thus happens-before) edges
between a release of a particular mutex instance and a subsequent
acquisition of the *same* mutex instance.   This creates a total order
on each particular mutex instance, but there is no happens-before
enforces with unrelated release or acquire operations (e.g., on
different mutex instances, atomics, ...).

This makes sense because it allows hardware and compilers to not have to
enforce global ordering.  And it aligns well with the use cases mutexes
are built for, namely critical sections and mutual exclusion.  Mutexes
do not interact with atomics sequenced before or after the mutex
release/acquire in the same way that a fences would interact with such
atomics.




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]