This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen

From: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, <i dot palachev at samsung dot com>, <nd at arm dot com>
Date: Fri, 22 Jan 2016 19:31:13 +0000
Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
Authentication-results: sourceware.org; auth=none
Nodisclaimer: True
References: <568D5E11 dot 3010301 at arm dot com> <5693D908 dot 8090104 at arm dot com> <5695B7BF dot 7050000 at redhat dot com> <569E8050 dot 20606 at arm dot com> <1453424564 dot 4592 dot 57 dot camel at localhost dot localdomain>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:23

On 22/01/16 01:02, Torvald Riegel wrote:

I've just looked at your patch so far and haven't analyzed all the
affected code.  Nonetheless, a few comments below;  so far, I would


thanks

rather postpone this to 2.23 because you do change a few things and I

ok

-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;


If you access a memory location with atomics somewhere, access it with
atomics everywhere (initialization is typically an exception).  This is

ok

important because it's our way to deal with not having atomic types
right now (and so the compiler is aware, too), and because it's
important for readers of the code because it alerts them that this is
concurrent code.

+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
  TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.
+	 See the CONCURRENCY NOTES there in dl-tls.c.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);


Is there any reason you wouldn't want a release MO atomic increment
here?


i wanted to avoid storing an overflowing counter.
(didn't want to analyze that case)

+   During dynamic TLS and DTV allocation and setup various objects may be
+   accessed concurrently:
+
+     GL(dl_tls_max_dtv_idx)
+     GL(dl_tls_generation)
+     listp->slotinfo[i].map
+     listp->slotinfo[i].gen
+     listp->next
+
+   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
+   APIs that may access them are
+
+     Writers: dlopen, dlclose and dynamic linker start up code.
+     Readers only: pthread_create and __tls_get_addr (TLS access).


I would first describe what these functions / features intend to do and
what synchronization problem they have to solve before going into
details of how that's implemented.  This should be sufficiently detailed
to answer all questions such as "why do I need this critical section
described below?".


i probably don't know all the details of that.

+   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
+   should be used when accessing these globals.


This covers the low-level data races, but it doesn't yet explain why
you'd want the critical sections.  Think about which atomicity and
ordering guarantees you need to make your high-level synchronization
scheme work.


i can write more about the synchronizations i added,
but not much about what other state dl_load_lock
protects.

+   dl_open_worker (called from dlopen) for each loaded module increases
+   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
+   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
+   the next generation number GL(dl_tls_generation)+1.  Then it increases
+   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.


You describe this as if it were sequential code.  Is it, or do you need
more detail?


yes this is sequential code in dlopen.
(but scattered around in many functions so it
is not immediately obvious)

+   This last write is release mo so previous writes can be synchronized.


Terminology: the matching acquire MO load synchronized with the release
MO write.  The writes sequenced before the release MO store than happen
before stuff sequenced after the acquire MO load.

Also, uppercase MO so it's clear it's an abbreviation.

ok

-
-  /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  size_t gen_count;
+  size_t dtv_slots;
+
+  /* Synchronize with the release mo store in dl_open_worker, modules with
+     larger generation number are ignored.  */
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
+  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
+     concurrently modified, but after the release mo store to


"after" in terms of which relation?  This is a good example of a place
where you could probably refer to a part of the high-level
synchronization scheme.


if the load of *_idx "happens after" the store of *_generation
then it is an upper bound to modids at that generation.

+     GL(dl_tls_generation) it always remains a modid upper bound for
+     previously loaded modules so relaxed access is enough.  */
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));

...

-	  map = listp->slotinfo[cnt].map;
+	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
+	     dlopen, so the generation number read below is for a valid entry.


Isn't it in dl-tls.c?


yes, but it is called from dlopen.

+	     TODO: remove_slotinfo in dlclose is not synchronized.  */
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
  	  if (map == NULL)
  	    /* Unused entry.  */
  	    continue;


It might be worthwhile to point out why missing out on a concurrent
update to map would be okay here, or why it can't actually happen.


ok

concurrent update of the relevant maps don't happen
(except in dlclose.. but that bug is not fixed)

only slotinfo entries that should be ignored here may
change concurrently (and the dtv/tls of those are
initialized lazily at first access).

+	  size_t gen = listp->slotinfo[cnt].gen;
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
  	  /* Keep track of the maximum generation number.  This might
  	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));


See my comments elsewhere in the thread.


i can keep the assert but the gen > gen_count cases
must be skipped here so the only way the assert can
fail is if the generation number overflows.
(otherwise gen <= gen_count <= GL(dl_tls_generation) holds)

-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);

  	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
  	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
  	}

        total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)


Could this have changed in the meantime, and thus be large enough?  You
change here what is potentially more than one memory access (depending
on what the compiler does to this, strictly speaking, sequential code)
to a single one in the beginning.


dtv_slots at the beginning is already an upper bound to
the relevant modids (and that's how much dtv we allocated),
so we can finish the iteration if total visited modules
grows beyond that.

rereading it does not cause a problem, but it does not
help either (can be smaller or larger as well, depending
on what dlopen/dlclose happened concurrently, the
slotinfo list does not shrink though)

@@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
  	 the first slot.  */
        assert (idx == 0);

-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
  	malloc (sizeof (struct dtv_slotinfo_list)
  		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
        if (listp == NULL)


There's another store to dl_tls_generation around here that you're not
accessing atomically.


yes, the oom code path

that looked broken logic: it does not check for overflow (bug)
it leaks memory (another bug) and the comment roughly says
"the application will crash anyway" so i wanted to avoid messing
with it.

but yes that should be atomic too.

@@ -939,9 +1000,15 @@ cannot create TLS data structures"));
        listp->next = NULL;
        memset (listp->slotinfo, '\0',
  	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      /* _dl_allocate_tls_init concurrently walks this list at thread
+	 creation, it must only observe initialized nodes in the list.


Another bit for the high-level scheme.

+	 See the CONCURRENCY NOTES there.  */
+      atomic_store_release (&prevp->next, listp);
      }

    /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
    listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.


Say why you need to enforce this, or point to part in the higher-level
scheme where this is explained.


_dl_allocate_tls_init walks the slotinfo list beyond
the entries it actually cares about, the extra entries
are either unused (zero) or initialized here with
large gen number.

so those extra entries must be possible to recognize.

i used sync so either map==0 or gen>gen_count is
visible to _dl_allocate_tls_init for new entries.

however now i see yet another bug here: the gen can
overflow and become 0 that way too, so the right fix
is to check for (map==0 || gen==0 || gen>gen_count)
and no sync is needed only relaxed atomic access.

but i have to verify gen==0 cannot happen otherwise
so it's ok to skip it in _dl_allocate_tls_init.

+     See the CONCURRENCY NOTES there.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
  }

Follow-Ups:
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Torvald Riegel

References:
- [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Carlos O'Donell
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]