This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix


Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
accesses.

The original ARM TLSDESC design did not discuss this race that
arises on systems with weak memory order guarantees

http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt

"Storing the final value of the TLS descriptor also needs care: the
resolver field must be set to its final value after the argument gets
its final value, such that any if thread attempts to use the
descriptor before it gets its final value, it still goes to the hold
function."

The current glibc code (i386, x86_64, arm, aarch64) is

  td->arg = ...;
  td->entry = ...;

the arm memory model allows store-store reordering, so a barrier is
needed between these two stores (the code is broken on x86 as well in
principle: x86 memory model does not help on the c code level, the
compiler can reorder non-aliasing stores).

What is missing from the TLSDESC design spec is a description of the
ordering requirements on the read side: the TLS access sequence
(generated by the compiler) loads the descriptor function pointer
(td->entry) and then the argument is loaded (td->arg) in that function.
These two loads must observe the changes made on the write side in a
sequentially consistent way. The code in asm:

  ldr x1, [x0]    // load td->entry
  blr [x0]        // call it

entryfunc:
  ldr x0, [x0,#8] // load td->arg

The branch (blr) does not provide a load-load ordering barrier (so the
second load may observe an old arg even though the first load observed
the new entry, this happens with existing aarch64 hardware causing
invalid memory access due to the incorrect TLS offset).

Various alternatives were considered to force the load ordering in the
descriptor function:

(1) barrier

entryfunc:
  dmb ishld
  ldr x0, [x0,#8]

(2) address dependency (if the address of the second load depends on the
result of the first one the ordering is guaranteed):

entryfunc:
  ldr x1,[x0]
  and x1,x1,#8
  orr x1,x1,#8
  ldr x0,[x0,x1]

(3) load-acquire (ARMv8 instruction that is ordered before subsequent
loads and stores)

entryfunc:
  ldar xzr,[x0]
  ldr x0,[x0,#8]

Option (1) is the simplest but slowest (note: this runs at every TLS
access), options (2) and (3) do one extra load from [x0] (same address
loads are ordered so it happens-after the load on the call site),
option (2) clobbers x1, so I think (3) is the best solution (a load
into the zero register has the same effect as with a normal register,
but the result is discarded so nothing is clobbered. Note that the
TLSDESC abi allows the descriptor function to clobber x1, but current
gcc does not implement this correctly, gcc will be fixed independently,
the dynamic and undefweak descriptors currently save/restore x1 so only
static TLS would be affected by this clobbering issue).

On the write side I used a full barrier (__sync_synchronize) although

  dmb ishst

would be enough, but write side is not performance critical.

I introduced a new _dl_tlsdesc_return_lazy descriptor function for
lazily relocated static TLS, so non-lazy use-case is not affected.
The dynamic and undefweak descriptors do enough work so the additional
ldar should not have a performance impact.

Other thoughts:

- Lazy binding for static TLS may be unnecessary complexity: it seems
that gcc generates at most two static TLSDESC relocation entries for a
translation unit (zero and non-zero initialized TLS), so there has to be
a lot of object files with static TLS linked into a shared object to
make the load time relocation overhead significant. Unless there is some
evidence that lazy static TLS relocation makes sense I would suggest
removing that logic (from all archs). (The dynamic case is probably a
similar micro-optimization, but there the lazy vs non-lazy semantics are
different in case of undefined symbols and some applications may depend
on that).

- 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
with option (1) or (2) with an additional twist: some existing ARMv7 cpus
don't implement the same address load ordering reliably, see:
http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

- I don't understand the role of the hold function in the general
TLSDESC design, why is it not enough to let other threads wait on the
global lock in the initial resolver function? Is the global dl lock
implemented as a spin lock? Is it for some liveness/fairness property?

- There are some incorrect uses of "cfi_adjust_cfa_offset" in
dl-tlsdesc.S which is a separate patch.

Changelog:

2015-04-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #18034]
	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
	(_dl_tlsdesc_dynamic): Likewise.
	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
	barrier between stores.

diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..ff74b52 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,25 @@ _dl_tlsdesc_return:
 	cfi_endproc
 	.size	_dl_tlsdesc_return, .-_dl_tlsdesc_return
 
+	/* Same as _dl_tlsdesc_return but with synchronization for
+	   lazy relocation.
+	   Prototype:
+	   _dl_tlsdesc_return_lazy (tlsdesc *) ;
+	 */
+	.hidden _dl_tlsdesc_return_lazy
+	.global	_dl_tlsdesc_return_lazy
+	.type	_dl_tlsdesc_return_lazy,%function
+	cfi_startproc
+	.align 2
+_dl_tlsdesc_return_lazy:
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
+	ldr	x0, [x0, #8]
+	RET
+	cfi_endproc
+	.size	_dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
 	/* Handler for undefined weak TLS symbols.
 	   Prototype:
 	   _dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +115,9 @@ _dl_tlsdesc_return:
 _dl_tlsdesc_undefweak:
 	str	x1, [sp, #-16]!
 	cfi_adjust_cfa_offset(16)
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x0, [x0, #8]
 	mrs	x1, tpidr_el0
 	sub	x0, x0, x1
@@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic:
 	stp	x3,  x4, [sp, #32+16*1]
 
 	mrs	x4, tpidr_el0
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x1, [x0,#8]
 	ldr	x0, [x4]
 	ldr	x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_return (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_undefweak (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
   if (!sym)
     {
       td->arg = (void*) reloc->r_addend;
+      /* Barrier so readers see the write above before the one below. */
+      __sync_synchronize ();
       td->entry = _dl_tlsdesc_undefweak;
     }
   else
@@ -98,6 +100,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
 					      + reloc->r_addend);
+	  /* Barrier so readers see the write above before the one below. */
+	  __sync_synchronize ();
 	  td->entry = _dl_tlsdesc_dynamic;
 	}
       else
@@ -105,7 +109,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = (void*) (sym->st_value + result->l_tls_offset
 			     + reloc->r_addend);
-	  td->entry = _dl_tlsdesc_return;
+	  /* Barrier so readers see the write above before the one below. */
+	  __sync_synchronize ();
+	  td->entry = _dl_tlsdesc_return_lazy;
 	}
     }
 




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]