This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
- From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>
- Date: Wed, 22 Apr 2015 18:14:32 +0100
- Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
- Authentication-results: sourceware.org; auth=none
- References: <553793A3 dot 7030206 at arm dot com> <1429718899 dot 6557 dot 17 dot camel at triegel dot csb>
On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> (2) address dependency (if the address of the second load depends on the
>> result of the first one the ordering is guaranteed):
>>
>> entryfunc:
>> ldr x1,[x0]
>> and x1,x1,#8
>> orr x1,x1,#8
>> ldr x0,[x0,x1]
>
> The address dependencies could be useful in terms of performance, but
> the disadvantage that I see are that there's currently no useful way to
> use them in C code (C11's memory_order_consume is broken).
>
this must be in asm so i don't think it matters if consume
is broken in c11.
(tlsdesc resolver cannot clobber registers).
>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>> loads and stores)
>>
>> entryfunc:
>> ldar xzr,[x0]
>> ldr x0,[x0,#8]
>>
>> Option (1) is the simplest but slowest (note: this runs at every TLS
>> access), options (2) and (3) do one extra load from [x0] (same address
>> loads are ordered so it happens-after the load on the call site),
>> option (2) clobbers x1, so I think (3) is the best solution (a load
>> into the zero register has the same effect as with a normal register,
>> but the result is discarded so nothing is clobbered. Note that the
>> TLSDESC abi allows the descriptor function to clobber x1, but current
>> gcc does not implement this correctly, gcc will be fixed independently,
>> the dynamic and undefweak descriptors currently save/restore x1 so only
>> static TLS would be affected by this clobbering issue).
>>
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>> dmb ishst
>>
>> would be enough, but write side is not performance critical.
>
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
>
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
>
yes, the write side could use c11 atomics and i agree that
store-release is enough, but i'm not sure how much one can
rely on c11 memory model when interacting with asm code.
(so i thought a full barrier is the easiest to reason about)
i can update the code to use glibc atomics, but i will
have to look into how those work (are there type issues
or are they generic?)
> I haven't scanned through the TLS code to assess how much work it would
> be to make it data-race-free (as defined by C11), but it would certainly
> be helpful in showing similar issues (e.g., when documenting it) and
> preventing compiler optimizations that are legal for non-concurrent
> accesses but not what we need for TLS (e.g., speculative loads
> introduced by the compiler on x86 in the case you mention above).
>
> Given that you have looked at the code, could you give a rough estimate
> of how much churn it would be to make the TLS code data-race-free?
i think td->entry should only be accessed with atomics once
the loader finished loading the dso with it.
(i assume during dso loading concurrent access is not possible)
(the particular case i'm trying to fix is hard because
the access to td->entry is generated by the compiler, so
it has to be worked around inside the entry function.
i think x86 does not have this issue)
i think these function might need some attention:
elf/tlsdeschtab.h:
_dl_tlsdesc_resolve_early_return_p
sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
_dl_tlsdesc_resolve_rela_fixup
_dl_tlsdesc_resolve_hold_fixup
but i haven't done an detailed analysis
> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> changes and ignoring some additional comments, very similar to the
> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and
> using the C11 model) that can be used on several archs? This would
> certainly decrease maintenance overhead.
>
should be possible i think: these functions are called
from asm to do the lazy resolution
but i have to check the arch specific details.
> I'm aware this might look like I'm requesting extra work that is not at
> the core of what you are trying to fix. However, moving to portable
> concurrent code that is shared across several archs is something we've
> been doing elsewhere too, and in those cases ARM was on the benefiting
> side (e.g., pthread_once, ongoing work in nscd, ...). I think overall,
> some extra work and clean-up will be a good thing for ARM archs too.
>
> Thanks.
>