This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen

From: Torvald Riegel <triegel at redhat dot com>
To: Szabolcs Nagy <szabolcs dot nagy at arm dot com>
Cc: "Carlos O'Donell" <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>, i dot palachev at samsung dot com, nd at arm dot com
Date: Fri, 22 Jan 2016 02:02:44 +0100
Subject: Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
Authentication-results: sourceware.org; auth=none
References: <568D5E11 dot 3010301 at arm dot com> <5693D908 dot 8090104 at arm dot com> <5695B7BF dot 7050000 at redhat dot com> <569E8050 dot 20606 at arm dot com>

I've just looked at your patch so far and haven't analyzed all the
affected code.  Nonetheless, a few comments below;  so far, I would
rather postpone this to 2.23 because you do change a few things and I
don't see the level of detail in analysis and code comments that we'd
ideally like to see for concurrency-related changes.

On Tue, 2016-01-19 at 18:28 +0000, Szabolcs Nagy wrote:
> diff --git a/elf/dl-open.c b/elf/dl-open.c
> index 6f178b3..2b97605 100644
> --- a/elf/dl-open.c
> +++ b/elf/dl-open.c
> @@ -524,9 +524,16 @@ dl_open_worker (void *a)
>      }
>  
>    /* Bump the generation number if necessary.  */
> -  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
> -    _dl_fatal_printf (N_("\
> +  if (any_tls)
> +    {
> +      size_t newgen = GL(dl_tls_generation) + 1;

If you access a memory location with atomics somewhere, access it with
atomics everywhere (initialization is typically an exception).  This is
important because it's our way to deal with not having atomic types
right now (and so the compiler is aware, too), and because it's
important for readers of the code because it alerts them that this is
concurrent code.

> +      if (__builtin_expect (newgen == 0, 0))
> +	_dl_fatal_printf (N_("\
>  TLS generation counter wrapped!  Please report this."));
> +      /* Synchronize with the load acquire in _dl_allocate_tls_init.
> +	 See the CONCURRENCY NOTES there in dl-tls.c.  */
> +      atomic_store_release (&GL(dl_tls_generation), newgen);

Is there any reason you wouldn't want a release MO atomic increment
here?

> +    }
>  
>    /* We need a second pass for static tls data, because _dl_update_slotinfo
>       must not be run while calls to _dl_add_to_slotinfo are still pending.  */
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index ed13fd9..7184a54 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -443,6 +443,48 @@ _dl_resize_dtv (dtv_t *dtv)
>  }
>  
> 
> +/* CONCURRENCY NOTES:

This is a good start, but I think we need more detail, and we need to
more carefully describe the high-level synchronization scheme (which we
are still working on, of course).

> +   During dynamic TLS and DTV allocation and setup various objects may be
> +   accessed concurrently:
> +
> +     GL(dl_tls_max_dtv_idx)
> +     GL(dl_tls_generation)
> +     listp->slotinfo[i].map
> +     listp->slotinfo[i].gen
> +     listp->next
> +
> +   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
> +   APIs that may access them are
> +
> +     Writers: dlopen, dlclose and dynamic linker start up code.
> +     Readers only: pthread_create and __tls_get_addr (TLS access).

I would first describe what these functions / features intend to do and
what synchronization problem they have to solve before going into
details of how that's implemented.  This should be sufficiently detailed
to answer all questions such as "why do I need this critical section
described below?".

> +   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
> +   should be used when accessing these globals.

This covers the low-level data races, but it doesn't yet explain why
you'd want the critical sections.  Think about which atomicity and
ordering guarantees you need to make your high-level synchronization
scheme work.

> +   dl_open_worker (called from dlopen) for each loaded module increases
> +   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
> +   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
> +   the next generation number GL(dl_tls_generation)+1.  Then it increases
> +   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.

You describe this as if it were sequential code.  Is it, or do you need
more detail?

> +   This last write is release mo so previous writes can be synchronized.

Terminology: the matching acquire MO load synchronized with the release
MO write.  The writes sequenced before the release MO store than happen
before stuff sequenced after the acquire MO load.

Also, uppercase MO so it's clear it's an abbreviation.

> +   GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready
> +   entries.  The slotinfo list might be shorter than that during dlopen.
> +   Entries in the slotinfo list might have gen > GL(dl_tls_generation) and
> +   map == NULL.

That bit sounds like a candidate for the high-level scheme.

> +   _dl_allocate_tls_init is called from pthread_create and it looks through
> +   the slotinfo list to do the dynamic TLS and DTV setup for the new thread.
> +   It first loads the current GL(dl_tls_generation) with acquire mo and only
> +   considers modules up to that generation ignoring any later change to the
> +   slotinfo list.
> +
> +   TODO: Entries might get changed and freed in dlclose without sync.
> +   TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose.
> +*/
> +
>  void *
>  internal_function
>  _dl_allocate_tls_init (void *result)
> @@ -455,9 +497,18 @@ _dl_allocate_tls_init (void *result)
>    struct dtv_slotinfo_list *listp;
>    size_t total = 0;
>    size_t maxgen = 0;
> -
> -  /* Check if the current dtv is big enough.   */
> -  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> +  size_t gen_count;
> +  size_t dtv_slots;
> +
> +  /* Synchronize with the release mo store in dl_open_worker, modules with
> +     larger generation number are ignored.  */
> +  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
> +  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
> +     concurrently modified, but after the release mo store to

"after" in terms of which relation?  This is a good example of a place
where you could probably refer to a part of the high-level
synchronization scheme.

> +     GL(dl_tls_generation) it always remains a modid upper bound for
> +     previously loaded modules so relaxed access is enough.  */
> +  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
> +  if (dtv[-1].counter < dtv_slots)
>      {
>        /* Resize the dtv.  */
>        dtv = _dl_resize_dtv (dtv);
> @@ -480,18 +531,25 @@ _dl_allocate_tls_init (void *result)
>  	  void *dest;
>  
>  	  /* Check for the total number of used slots.  */
> -	  if (total + cnt > GL(dl_tls_max_dtv_idx))
> +	  if (total + cnt > dtv_slots)
>  	    break;
>  
> -	  map = listp->slotinfo[cnt].map;
> +	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
> +	     dlopen, so the generation number read below is for a valid entry.

Isn't it in dl-tls.c?

> +	     TODO: remove_slotinfo in dlclose is not synchronized.  */
> +	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
>  	  if (map == NULL)
>  	    /* Unused entry.  */
>  	    continue;

It might be worthwhile to point out why missing out on a concurrent
update to map would be okay here, or why it can't actually happen.

> +	  size_t gen = listp->slotinfo[cnt].gen;
> +	  if (gen > gen_count)
> +	    /* New, concurrently loaded entry.  */
> +	    continue;
> +
>  	  /* Keep track of the maximum generation number.  This might
>  	     not be the generation counter.  */
> -	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));

See my comments elsewhere in the thread.

> -	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
> +	  maxgen = MAX (maxgen, gen);
>  
>  	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
>  	  dtv[map->l_tls_modid].pointer.is_static = false;
> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
>  	}
>  
>        total += cnt;
> -      if (total >= GL(dl_tls_max_dtv_idx))
> +      if (total >= dtv_slots)

Could this have changed in the meantime, and thus be large enough?  You
change here what is potentially more than one memory access (depending
on what the compiler does to this, strictly speaking, sequential code)
to a single one in the beginning.

>  	break;
>  
> -      listp = listp->next;
> -      assert (listp != NULL);
> +      /* Synchronize with the release mo store in _dl_add_to_slotinfo
> +	 so only initialized slotinfo nodes are looked at.  */
> +      listp = atomic_load_acquire (&listp->next);

> +      if (listp == NULL)
> +	break;
>      }
>  
>    /* The DTV version is up-to-date now.  */
> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
>  	 the first slot.  */
>        assert (idx == 0);
>  
> -      listp = prevp->next = (struct dtv_slotinfo_list *)
> +      listp = (struct dtv_slotinfo_list *)
>  	malloc (sizeof (struct dtv_slotinfo_list)
>  		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
>        if (listp == NULL)

There's another store to dl_tls_generation around here that you're not
accessing atomically.

> @@ -939,9 +1000,15 @@ cannot create TLS data structures"));
>        listp->next = NULL;
>        memset (listp->slotinfo, '\0',
>  	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> +      /* _dl_allocate_tls_init concurrently walks this list at thread
> +	 creation, it must only observe initialized nodes in the list.

Another bit for the high-level scheme.

> +	 See the CONCURRENCY NOTES there.  */
> +      atomic_store_release (&prevp->next, listp);
>      }
>  
>    /* Add the information into the slotinfo data structure.  */
> -  listp->slotinfo[idx].map = l;
>    listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
> +  /* Synchronize with the acquire load in _dl_allocate_tls_init.

Say why you need to enforce this, or point to part in the higher-level
scheme where this is explained.

> +     See the CONCURRENCY NOTES there.  */
> +  atomic_store_release (&listp->slotinfo[idx].map, l);
>  }

Follow-Ups:
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy

References:
- [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Carlos O'Donell
- Re: [PATCH][BZ #19329] Fix race between tls allocation at thread creation and dlopen
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]