This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen


On 13/01/16 02:34, Carlos O'Donell wrote:
On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>

     [BZ #19329]
     * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
     atomically.
     * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
     GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
     (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
     atomically.

You are headed in the right direction. I like where this patch is going,
but don't have enough time to review this in detail yet.

At first glance your patch lacks sufficient concurrency documentation to
be acceptable. You need to document which acquires the releases
synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
of detail we need to maintain these kinds of changes.

Patch v3:

I added some concurrency notes.  There are some TODO items left in
the comments.

I decided not to change __tls_get_addr and dlclose code paths, this
patch only changes comments compared to v2.

Overall the patch only turns some load/stores into atomic and
changes their order (but not in a way that can be observed without
concurrently running pthread_create and dlopen).  Two branches
are changed: if slotinfo[i].gen >  GL(dl_tls_generation) then that
entry is skipped instead of abort, and if listp->next == NULL then
the iteration is finished instead of abort.

These changes might not be entirely safe: dlclose is still not
synced with pthread_create and there might be other bugs that can
cause large slotinfo[i].gen or early list->next==NULL, then the
current glibc code would abort and the new code will continue
without noticing this.

Note that the current asserts do not cover the concurrency bugs
completely: concurrent dlclose can (even with the patch) lead to
use after free, concurrently added modules can lead to read of
uninitialized slotinfo entry where map is already non-null, but
gen is 0 (and thus smaller than the current generation count)
which can lead to out-of-bounds write to the DTV or other memory
corruption.  (If the asserts were more helpfully protecting against
memory corruption, then i'd be reluctant removing them before a
complete fix for the concurrency issues.)

The acquire load of every slotinfo[i].map might be excessive use
of atomics in _dl_allocate_tls_init, but i think it is necessary
with the current design.

I'm OK if this gets rescheduled after 2.23.

2016-01-19  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
	atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
	(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
	atomically.
diff --git a/elf/dl-open.c b/elf/dl-open.c
index 6f178b3..2b97605 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,16 @@ dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;
+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.
+	 See the CONCURRENCY NOTES there in dl-tls.c.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);
+    }
 
   /* We need a second pass for static tls data, because _dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..7184a54 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -443,6 +443,48 @@ _dl_resize_dtv (dtv_t *dtv)
 }
 
 
+/* CONCURRENCY NOTES:
+
+   During dynamic TLS and DTV allocation and setup various objects may be
+   accessed concurrently:
+
+     GL(dl_tls_max_dtv_idx)
+     GL(dl_tls_generation)
+     listp->slotinfo[i].map
+     listp->slotinfo[i].gen
+     listp->next
+
+   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
+   APIs that may access them are
+
+     Writers: dlopen, dlclose and dynamic linker start up code.
+     Readers only: pthread_create and __tls_get_addr (TLS access).
+
+   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
+   should be used when accessing these globals.
+
+   dl_open_worker (called from dlopen) for each loaded module increases
+   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
+   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
+   the next generation number GL(dl_tls_generation)+1.  Then it increases
+   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.
+   This last write is release mo so previous writes can be synchronized.
+
+   GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready
+   entries.  The slotinfo list might be shorter than that during dlopen.
+   Entries in the slotinfo list might have gen > GL(dl_tls_generation) and
+   map == NULL.
+
+   _dl_allocate_tls_init is called from pthread_create and it looks through
+   the slotinfo list to do the dynamic TLS and DTV setup for the new thread.
+   It first loads the current GL(dl_tls_generation) with acquire mo and only
+   considers modules up to that generation ignoring any later change to the
+   slotinfo list.
+
+   TODO: Entries might get changed and freed in dlclose without sync.
+   TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose.
+*/
+
 void *
 internal_function
 _dl_allocate_tls_init (void *result)
@@ -455,9 +497,18 @@ _dl_allocate_tls_init (void *result)
   struct dtv_slotinfo_list *listp;
   size_t total = 0;
   size_t maxgen = 0;
-
-  /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  size_t gen_count;
+  size_t dtv_slots;
+
+  /* Synchronize with the release mo store in dl_open_worker, modules with
+     larger generation number are ignored.  */
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
+  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
+     concurrently modified, but after the release mo store to
+     GL(dl_tls_generation) it always remains a modid upper bound for
+     previously loaded modules so relaxed access is enough.  */
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
+  if (dtv[-1].counter < dtv_slots)
     {
       /* Resize the dtv.  */
       dtv = _dl_resize_dtv (dtv);
@@ -480,18 +531,25 @@ _dl_allocate_tls_init (void *result)
 	  void *dest;
 
 	  /* Check for the total number of used slots.  */
-	  if (total + cnt > GL(dl_tls_max_dtv_idx))
+	  if (total + cnt > dtv_slots)
 	    break;
 
-	  map = listp->slotinfo[cnt].map;
+	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
+	     dlopen, so the generation number read below is for a valid entry.
+	     TODO: remove_slotinfo in dlclose is not synchronized.  */
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
 	  if (map == NULL)
 	    /* Unused entry.  */
 	    continue;
 
+	  size_t gen = listp->slotinfo[cnt].gen;
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
 	  /* Keep track of the maximum generation number.  This might
 	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);
 
 	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
 	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
 	}
 
       total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)
 	break;
 
-      listp = listp->next;
-      assert (listp != NULL);
+      /* Synchronize with the release mo store in _dl_add_to_slotinfo
+	 so only initialized slotinfo nodes are looked at.  */
+      listp = atomic_load_acquire (&listp->next);
+      if (listp == NULL)
+	break;
     }
 
   /* The DTV version is up-to-date now.  */
@@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
 	 the first slot.  */
       assert (idx == 0);
 
-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
       if (listp == NULL)
@@ -939,9 +1000,15 @@ cannot create TLS data structures"));
       listp->next = NULL;
       memset (listp->slotinfo, '\0',
 	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      /* _dl_allocate_tls_init concurrently walks this list at thread
+	 creation, it must only observe initialized nodes in the list.
+	 See the CONCURRENCY NOTES there.  */
+      atomic_store_release (&prevp->next, listp);
     }
 
   /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
   listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.
+     See the CONCURRENCY NOTES there.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
 }

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]