This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: GNU C Library <libc-alpha at sourceware dot org>
Cc: <i dot palachev at samsung dot com>, "triegel at redhat dot com" <triegel at redhat dot com>, <nd at arm dot com>
Date: Mon, 11 Jan 2016 16:32:08 +0000
Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
Authentication-results: sourceware.org; auth=none
Nodisclaimer: True
References: <568D5E11 dot 3010301 at arm dot com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:23

Changes in patch v2:

* The listp->next race is fixed.

* GL(dl_tls_generation) is loaded before GL(dl_tls_max_dtv_idx).
  (matters if several dlopen happens during _dl_allocate_tls_init
  between those two loads.)

* more detailed analysis below.

I can trigger the bug on x86_64 by loading a library with >50 deps
with tls and concurrently creating >1000 threads, this was not turned
into a glibc test case, but attached to the bug report.
(On aarch64 the failure happens to be easier to trigger.)


The following failures can be triggered:

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <=_rtld_local._dl_tls_generation' failed!


and

Inconsistency detected by ld.so: dl-tls.c: 525: _dl_allocate_tls_init: Assertion `listp != NULL' failed!

dlopen modifies tls related dynamic linker data structures while holding
the GL(dl_load_lock) if the loaded library has tls.

Meanwhile at thread creation the same globals are accessed when the dtv
and tls is set up for the new thread in _dl_allocate_tls_init without
holding any locks.

At least the following global objects may have conflicting access:

  GL(dl_tls_max_dtv_idx)
  GL(dl_tls_generation)
  listp->slotinfo[i].map
  listp->slotinfo[i].gen
  listp->next

where listp points to a node in GL(dl_tls_dtv_slotinfo_list).

The race window for the first assert failure above is short compared to
dlopen and thread creation, so it rarely happens, but the probability
can be increased if the loaded library has a lot of dependencies with
tls, and if the deps don't fit into the same listp->slotinfo array then
the second assert failure starts to happen.

The current sequence of events is:

dlopen:

  The modids of the loaded library and its dependencies are set one by
  one to ++GL(dl_tls_max_dtv_idx) (assuming they have tls).

  Then these modules are added to GL(dl_tls_dtv_slotinfo_list) one by
  one with the same generation number: GL(dl_tls_generation)+1.

  Then the GL(dl_tls_generation) is increased.

pthread_create:

  The dtv for the thread is allocated assuming GL(dl_tls_max_dtv_idx) is
  the max modid that will be added to the dtv here.

  The GL(dl_tls_dtv_slotinfo_list) is walked to initialize dtv[modid]
  for all modules in the list and initialize the related tls.

  At the end dtv[0].counter is set to the max generation number seen,
  all modules with less-than-or-equal number must have initialized dtv
  and tls in this thread.

The patch uses atomics to add modules to the slotinfo list and to increase
GL(dl_tls_generation) during dlopen and similarly uses atomics to safely
walk the slotinfo list in pthread_create.  The dtv and tls are initialized
for all modules up to the observed GL(dl_tls_generation), modules with
larger generation number (concurrently loaded modules) are ignored.


Further issues:

dlclose also modifies the slotinfo list in unsafe ways and i don't
immediately see a lockfree way to synchronize that with thread
creation. (using the GL(dl_load_lock) at thread creation does not work
because it can deadlock if pthread_create is called from a ctor while
dlopen holds the lock.)
i.e. this patch assumes dlclose is not called concurrently with
pthread_create.

Two (incorrect) assertions were removed from the code and i don't
see an easy way to do similar runtime checks to guard against similar
races or memory corruptions.

I did not review all accesses to the problematic globals there are most
likely other problems (e.g. GL(dl_tls_max_dtv_idx) is modified without
atomics, tls access (__tls_get_addr) does not use atomics to access the
same globals, etc)

Glibc does not try to do worst-case allocation of tls for existing
threads whenever a library is loaded so allocation might be needed at
the first access to tls objects, making it non-as-safe and possibly oom
crash.  The dtv and tls of the modules that are ignored in this patch
will be lazy initialized.


Changelog:

2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
	atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
	(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
	atomically.

diff --git a/elf/dl-open.c b/elf/dl-open.c
index 6f178b3..7315c3a 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,15 @@ dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;
+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);
+    }
 
   /* We need a second pass for static tls data, because _dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..0afbcab 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -455,9 +455,15 @@ _dl_allocate_tls_init (void *result)
   struct dtv_slotinfo_list *listp;
   size_t total = 0;
   size_t maxgen = 0;
+  size_t gen_count;
+  size_t dtv_slots;
 
+  /* Synchronize with dl_open_worker, concurrently loaded modules with
+     larger generation number should be ignored here.  */
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
   /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
+  if (dtv[-1].counter < dtv_slots)
     {
       /* Resize the dtv.  */
       dtv = _dl_resize_dtv (dtv);
@@ -480,18 +486,23 @@ _dl_allocate_tls_init (void *result)
 	  void *dest;
 
 	  /* Check for the total number of used slots.  */
-	  if (total + cnt > GL(dl_tls_max_dtv_idx))
+	  if (total + cnt > dtv_slots)
 	    break;
 
-	  map = listp->slotinfo[cnt].map;
+	  /* Synchronize with _dl_add_to_slotinfo.  */
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
 	  if (map == NULL)
 	    /* Unused entry.  */
 	    continue;
 
+	  size_t gen = listp->slotinfo[cnt].gen;
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
 	  /* Keep track of the maximum generation number.  This might
 	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);
 
 	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
 	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,11 +529,13 @@ _dl_allocate_tls_init (void *result)
 	}
 
       total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)
 	break;
 
-      listp = listp->next;
-      assert (listp != NULL);
+      /* Synchronize with _dl_add_to_slotinfo.  */
+      listp = atomic_load_acquire (&listp->next);
+      if (listp == NULL)
+	break;
     }
 
   /* The DTV version is up-to-date now.  */
@@ -916,7 +929,7 @@ _dl_add_to_slotinfo (struct link_map *l)
 	 the first slot.  */
       assert (idx == 0);
 
-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
       if (listp == NULL)
@@ -939,9 +952,13 @@ cannot create TLS data structures"));
       listp->next = NULL;
       memset (listp->slotinfo, '\0',
 	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      /* _dl_allocate_tls_init concurrently walks this list at thread
+	 creation, it must only observe initialized nodes in the list.  */
+      atomic_store_release (&prevp->next, listp);
     }
 
   /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
   listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
 }

Follow-Ups:
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Carlos O'Donell

References:
- [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]