This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix



On 24/04/15 18:37, Torvald Riegel wrote:
> On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: 
>> On 22/04/15 18:14, Szabolcs Nagy wrote:
> 
>> the td->entry can be in 3 states during lazy resolution:
>>
>> * init: retries the call of entry until caller==entry
>>   (using a double-checked locking mechanism, then it
>>   - grabs GL(dl_load_lock)
>>   - changes the entry to the hold state
>>   - does the resolver work
>>   - changes arg and entry to the final value
>>   - releases the lock to wake the waiters
>>   - calls the new entry)
>> * hold: waits on the GL(dl_load_lock)
>>   (then retries calling the entry when it's woken up)
>> * final state: (static, dynamic or undefweak callback)
>>   calculates the tls offset based on arg
>>   (currently without any locks which is bad on weak
>>   memory models)
> 
> Could you point me to (or confirm that) a new load of td->entry happens
> if caller != td->entry?  In the asm bits you posted so far, it seems
> that if the call to *td->entry returns, actual TLS data is loaded.  For
> example, you posted this:

if td->entry is in init state it will do the retry
(when it ends up in _dl_tlsdesc_resolve_early_return_p)

> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 

this is supposed to be 'blr x1', but you got the meaning

> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> 
> I'm asking because in tlsdesctab.h we have:
> 
> static int
> _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
> {
>   if (caller != td->entry)
>     return 1;
> 
>   __rtld_lock_lock_recursive (GL(dl_load_lock));
>   if (caller != td->entry)
>     {
>       __rtld_lock_unlock_recursive (GL(dl_load_lock));
>       return 1;
> ....
> 
> If _dl_tlsdesc_resolve_early_return_p returns 1,
> _dl_tlsdesc_resolve_rela_fixup just returns.  If the next thing we do is
> load td->arg, we need another acquire-MO load for td->entry in

no, if _dl_tlsdesc_resolve_rela_fixup returns it ends up
in _dl_tlsdesc_resolve_rela (asm code) that loads td->entry again
and calls the entry again (this is the retry loop)

(so when the resolver updated the entry then the new entry is
actually called, and in case of early return there is a retry)

> _dl_tlsdesc_resolve_early_return_p.  A relaxed-MO load (or the current
> non-atomic access, ingoring DRF) would only be sufficient if all that
> caller != atomic_load_relaxed (&td->entry) leads to is further busy
> waiting (eg, calling *td->entry again).  But if we really expect
> caller != td->entry to have meaning and tell us something about other
> state, we need to have another barrier there to not have a typical
> double-checked-locking bug.
> 

this is why i said you need atomics for accessing td->entry

but this is generic code and not related to the aarch64 bug
so i'd prefer if any such cleanup were done independently

> (This shows another benefit of using atomics for all concurrent
> accesses: It forces us to make a conscious decision about the memory
> orders, and document why.  If a relaxed-MO access is sufficient, it
> should better have a clear explanation...)
> 
>>
>> the code for tls access is generated by the compiler so
>> there is no atomics there: the loaded entry can be in
>> init, hold or final state, the guarantee about the state
>> of arg must come from the target arch memory model.
> 
> I understood there are no C11 atomics used in the asm implementation
> bits.  I'm thinking about abstract the synchronization scheme we want to
> use in the implementation, represented through C11 atomics
> synchronization; the asm would just do something equivalent (see my
> comments about the atomics implementation being effectively part of the
> ABI).
> 

ok

>> and to answer my earlier question about the hold state:
>> it is needed to avoid the spinning in the double-checked
>> locking in init.
>>
>>>> It also looks as if the x86_64 variant of tlsdesc.c is, before your
>>>> changes and ignoring some additional comments, very similar to the
>>>> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
>>>> using the C11 model) that can be used on several archs?  This would
>>>> certainly decrease maintenance overhead.
>>>>
>>>
>>> should be possible i think: these functions are called
>>> from asm to do the lazy resolution
>>>
>>> but i have to check the arch specific details.
>>
>> looking at this, but it seems to be significant work:
>>
>> * i386 and arm determine the caller differently
>> * i386 uses special calling convention from asm
>> * arm handles local symbols specially
> 
> Okay.  But do they still use the same abstract synchronization scheme,
> and just differ in a few aspects?  Or are the synchronization schemes
> significantly different?
> 

the synchronization scheme is the same

but if the calling convention is non-C (regparm) then you
cannot use the same generic C code (without significant
changes somewhere).

>> i think the way to move forward is
> 
> I think step 1 should be to document the synchronization scheme on an
> abstract, algorithmic level.  Using C11 atomics in this pseudo-code.  We
> want to document the abstract level because it's typically much easier
> to reason about the abstraction than the concrete implementation in case
> of concurrent code, and because confirming that an implementation
> matches the abstraction is typically also much easier than inferring the
> abstraction from a concrete implementation.
> 

i think the documentation is the tlsdesc abi spec
(synchronization requirements can be derived from that)

i think we need to fix the bug instead of writing more documents

(you can build the theories later, it will inevitably cause a lot
of changes because the lazy dl code is far from perfect)

> So, for example, if we use a kind of double-checked locking here,
> document that, show the pseudo code with C11 atomics, and then refer
> back to this when documenting why certain memory orders on atomic
> operations (or asm instructions) are sufficient and equivalent to the
> abstract algorithm.  In our case, this would mean saying that, for
> example, we need load Y to use acquire MO so that it synchronizes with
> release-MO store X, making sure that a load B that is sequenced after X
> reads from store A (which is a core part of double-checked locking, so
> could also be done at the abstract algorithm docs and then just referred
> back to from the concrete implementation code).
> 
>> * fix the correctness bug now on aarch64
> 
> OK.  When doing that, check and document equivalence to the abstract
> synchronization scheme.  When you do that, please use atomic_* for all
> synchronization that you add or change.
> It might be easiest to also change all atomic accesses that you checked
> and documented to using atomic_*; this isn't much work compared to the
> checking, and it becomes obvious what has been checked.
> 
>> * decide if lazy static tls resolver is worth it
>> * do the refactoring so tlsdesc is common code
>> * use c11 atomics
> 
> I'd prefer C11 atomics to be used right away when fixing things;
> however, if you mean transforming and checking all other TLS code
> unrelated to the bug you're looking at, then I agree that this can be
> done incrementally and after you fixed this bug.
> 

ok i'll do the atomic_store on td->entry instead of __sync_synchronize

>> the fix for arm is a lot harder (because atomics are
>> different on different versions of arm, an ifunc based
>> dispatch could in principle solve the dmb vs no-dmb
>> issue for the lazy resolvers).
> 
> Is the synchronization used in the asm bits on the different versions of
> arm different from how C11 atomics would look like on those versions of
> arm?  Or are we just dealing with different ways to implement acquire
> loads, for example?
> 

i dont know the details of c11 atomics on arm

but i assume the emitted code depends on which cpu
you compile for (so lots of ifdefs in the asm, or
you can detect the current cpu at runtime with ifunc
which is also ugly)


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]