This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix

From: Torvald Riegel <triegel at redhat dot com>
To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>
Date: Thu, 23 Apr 2015 23:46:56 +0200
Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
Authentication-results: sourceware.org; auth=none
References: <553793A3 dot 7030206 at arm dot com> <1429718899 dot 6557 dot 17 dot camel at triegel dot csb> <5537D6F8 dot 1030509 at arm dot com>

On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> On 22/04/15 17:08, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> >> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> >> loads and stores)
> >>
> >> entryfunc:
> >>   ldar xzr,[x0]
> >>   ldr x0,[x0,#8]
> >>
> >> Option (1) is the simplest but slowest (note: this runs at every TLS
> >> access), options (2) and (3) do one extra load from [x0] (same address
> >> loads are ordered so it happens-after the load on the call site),
> >> option (2) clobbers x1, so I think (3) is the best solution (a load
> >> into the zero register has the same effect as with a normal register,
> >> but the result is discarded so nothing is clobbered. Note that the
> >> TLSDESC abi allows the descriptor function to clobber x1, but current
> >> gcc does not implement this correctly, gcc will be fixed independently,
> >> the dynamic and undefweak descriptors currently save/restore x1 so only
> >> static TLS would be affected by this clobbering issue).
> >>
> >> On the write side I used a full barrier (__sync_synchronize) although
> >>
> >>   dmb ishst
> >>
> >> would be enough, but write side is not performance critical.
> > 
> > Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> > Fixes to existing code should use the C11 memory model for concurrent
> > code, especially if the fix is about a concurrency issue.
> > 
> > I agree with Adhemerval that a release MO store seems to be sufficient
> > in this case.
> > 
> 
> yes, the write side could use c11 atomics and i agree that
> store-release is enough, but i'm not sure how much one can
> rely on c11 memory model when interacting with asm code.
> (so i thought a full barrier is the easiest to reason about)

IIUC, the asm that you need to interact with is generated by the
compiler.  The C11 atomics implementation (that is, the asm / HW
features used in it) are effectively part of the ABI; C11 translation
units need to be able to synchronize with each other even if compiled by
different compiler versions.

Therefore, making sure that the asm follows what the compiler would
generate for the intended C11 atomics equivalent is a solution.  That
also means that we can say clearly, in C11 memory model terms, what the
asm implementation on an arch should do.

> i can update the code to use glibc atomics, but i will
> have to look into how those work (are there type issues
> or are they generic?)

There are basically generic, and internally use either the __atomic
builtins for newer compilers and archs that provide them, or (fall back
to) the previous implementation based on __sync-builtins or custom asm.
One constraint is that we don't support sizes of atomic access that
aren't actually provided on the particular arch (either through HW
instructions, or with kernel helpers).  So, for example, atomic access
to a naturally aligned pointer-sized type can be expected to work.

> > I haven't scanned through the TLS code to assess how much work it would
> > be to make it data-race-free (as defined by C11), but it would certainly
> > be helpful in showing similar issues (e.g., when documenting it) and
> > preventing compiler optimizations that are legal for non-concurrent
> > accesses but not what we need for TLS (e.g., speculative loads
> > introduced by the compiler on x86 in the case you mention above).
> > 
> > Given that you have looked at the code, could you give a rough estimate
> > of how much churn it would be to make the TLS code data-race-free?
> 
> i think td->entry should only be accessed with atomics once
> the loader finished loading the dso with it.
> (i assume during dso loading concurrent access is not possible)

If that's some kind of initialization you have in mind, using nonatomics
accesses may be fine.  We don't have strict rules around that currently.
But if it's just a few lines of code, then simply using relaxed-MO
atomic accesses could be fine as well.

> (the particular case i'm trying to fix is hard because
> the access to td->entry is generated by the compiler, so
> it has to be worked around inside the entry function.
> i think x86 does not have this issue)

Do you mean that you want to affect the compiler-generated code after it
has been generated?

Follow-Ups:
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy

References:
- [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Torvald Riegel
- Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]