This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix


On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> (2) address dependency (if the address of the second load depends on the
>> result of the first one the ordering is guaranteed):
>>
>> entryfunc:
>>   ldr x1,[x0]
>>   and x1,x1,#8
>>   orr x1,x1,#8
>>   ldr x0,[x0,x1]
> 
> The address dependencies could be useful in terms of performance, but
> the disadvantage that I see are that there's currently no useful way to
> use them in C code (C11's memory_order_consume is broken).
> 

this must be in asm so i don't think it matters if consume
is broken in c11.

(tlsdesc resolver cannot clobber registers).

>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>> loads and stores)
>>
>> entryfunc:
>>   ldar xzr,[x0]
>>   ldr x0,[x0,#8]
>>
>> Option (1) is the simplest but slowest (note: this runs at every TLS
>> access), options (2) and (3) do one extra load from [x0] (same address
>> loads are ordered so it happens-after the load on the call site),
>> option (2) clobbers x1, so I think (3) is the best solution (a load
>> into the zero register has the same effect as with a normal register,
>> but the result is discarded so nothing is clobbered. Note that the
>> TLSDESC abi allows the descriptor function to clobber x1, but current
>> gcc does not implement this correctly, gcc will be fixed independently,
>> the dynamic and undefweak descriptors currently save/restore x1 so only
>> static TLS would be affected by this clobbering issue).
>>
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>>   dmb ishst
>>
>> would be enough, but write side is not performance critical.
> 
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
> 
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
> 

yes, the write side could use c11 atomics and i agree that
store-release is enough, but i'm not sure how much one can
rely on c11 memory model when interacting with asm code.
(so i thought a full barrier is the easiest to reason about)

i can update the code to use glibc atomics, but i will
have to look into how those work (are there type issues
or are they generic?)

> I haven't scanned through the TLS code to assess how much work it would
> be to make it data-race-free (as defined by C11), but it would certainly
> be helpful in showing similar issues (e.g., when documenting it) and
> preventing compiler optimizations that are legal for non-concurrent
> accesses but not what we need for TLS (e.g., speculative loads
> introduced by the compiler on x86 in the case you mention above).
> 
> Given that you have looked at the code, could you give a rough estimate
> of how much churn it would be to make the TLS code data-race-free?

i think td->entry should only be accessed with atomics once
the loader finished loading the dso with it.
(i assume during dso loading concurrent access is not possible)

(the particular case i'm trying to fix is hard because
the access to td->entry is generated by the compiler, so
it has to be worked around inside the entry function.
i think x86 does not have this issue)

i think these function might need some attention:

elf/tlsdeschtab.h:
_dl_tlsdesc_resolve_early_return_p
sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
_dl_tlsdesc_resolve_rela_fixup
_dl_tlsdesc_resolve_hold_fixup

but i haven't done an detailed analysis

> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> changes and ignoring some additional comments, very similar to the
> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
> using the C11 model) that can be used on several archs?  This would
> certainly decrease maintenance overhead.
> 

should be possible i think: these functions are called
from asm to do the lazy resolution

but i have to check the arch specific details.

> I'm aware this might look like I'm requesting extra work that is not at
> the core of what you are trying to fix.  However, moving to portable
> concurrent code that is shared across several archs is something we've
> been doing elsewhere too, and in those cases ARM was on the benefiting
> side (e.g., pthread_once, ongoing work in nscd, ...).  I think overall,
> some extra work and clean-up will be a good thing for ARM archs too.
> 
> Thanks.
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]