This is the mail archive of the libc-ports@sources.redhat.com mailing list for the libc-ports project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Unify pthread_once (bug 15215)


On 05/08/2013 10:43 AM, Torvald Riegel wrote:
> See http://sourceware.org/bugzilla/show_bug.cgi?id=15215 for background.

You've already hashed out the details of these changes with Rich
and he has no objection with this first phase of the patch which
is to unify the implementations.
 
> I1 and I2 follow essentially the same algorithm, and we can replace it
> with a unified variant, as the bug suggests.  See the attached patch for
> a modified version of the sparc instance.  The differences between both
> are either cosmetic, or are unnecessary changes (ie, how the
> init-finished state is set (atomic_inc vs. store), or how the fork
> generations are compared).
> 
> Both I1 and I2 were missing a release memory order (MO) when marking
> once_control as finished initialization.  If the particular arch doesn't
> need a HW barrier for release, we at least need a compiler barrier; if
> it's needed, the original I1 and I2 are not guaranteed to work.
> 
> Both I1 and I2 were missing acquire MO on the very first load of
> once_control.  This needs to synchronize with the release MO on setting
> the state to init-finished, so without it it's not guaranteed to work
> either.
> Note that this will make a call to pthread_once that doesn't need to
> actually run the init routine slightly slower due to the additional
> acquire barrier.  If you're really concerned about this overhead, speak
> up.  There are ways to avoid it, but it comes with additional complexity
> and bookkeeping.

We want correctness. This is a place where correctness is infinitely
more important than speed. We should be correct first and then we
should argue about how to make it fast.

> I'm currently also using the existing atomic_{read/write}_barrier
> functions instead of not-yet-existing load_acq or store_rel functions.
> I'm not sure whether the latter can have somewhat more efficient
> implementations on Power and ARM; if so, and if you're concerned about
> the overhead, we can add load_acq and store_rel to atomic.h and start
> using it.  This would be in line with C11, where we should eventually be
> heading to anyways, IMO.

Agreed.

> Both I1 and I2 have an ABA issue on __fork_generation, as explained in
> the comments that the patch adds.  How do you all feel about this?
> I can't present a simple fix right now, but I believe it could be fixed
> with additional bookkeeping.
> 
> If there's no objection to the essence of this patch, I'll post another
> patch that actually replaces I1 and I2 with the modified variant in the
> attached patch.

Please repost.

> Cleaning up the magic numbers, perhaps fixing the ABA issue, and
> comparing to the custom asm versions would be next.  I had a brief look
> at the latter, and at least x86 doesn't seem to do anything logically
> different.

Right, that can be another step.
 
> diff --git a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> index 5879f44..f9b0953 100644
> --- a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> +++ b/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
> @@ -28,11 +28,31 @@ clear_once_control (void *arg)
>  {
>    pthread_once_t *once_control = (pthread_once_t *) arg;
>  
> +  /* Reset to the uninitialized state here (see __pthread_once).  Also, we
> +     don't need a stronger memory order because we do not need to make any
> +     other of our writes visible to other threads that see this value.  */
>    *once_control = 0;
>    lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
>  }
>  
>  
> +/* This is similar to a lock implementation, but we distinguish between three
> +   states: not yet initialized (0), initialization finished (2), and
> +   initialization in progress (__fork_generation | 1).  If in the first state,
> +   threads will try to run the initialization by moving to the second state;
> +   the first thread to do so via a CAS on once_control runs init_routine,
> +   other threads block.
> +   When forking the process, some threads can be interrupted during the second
> +   state; they won't be present in the forked child, so we need to restart
> +   initialization in the child.  To distinguish an in-progress initialization
> +   from an interrupted initialization (in which case we need to reclaim the
> +   lock), we look at the fork generation that's part of the second state: We
> +   can reclaim iff it differs from the current fork generation.
> +   XXX: This algorithm has an ABA issue on the fork generation: If an
> +   initialization is interrupted, we then fork 2^30 times (30b of once_control
> +   are used for the fork generation), and try to initialize again, we can
> +   deadlock because we can't distinguish the in-progress and interrupted cases
> +   anymore.  */

Good comment. Good note on the ABA issue, even if somewhat impractical today.

>  int
>  __pthread_once (once_control, init_routine)
>       pthread_once_t *once_control;
> @@ -42,15 +62,26 @@ __pthread_once (once_control, init_routine)
>      {
>        int oldval, val, newval;
>  
> +      /* We need acquire memory order for this load because if the value
> +         signals that initialization has finished, we need to be see any
> +         data modifications done during initialization.  */
>        val = *once_control;
> +      atomic_read_barrier();
>        do
>  	{
> -	  /* Check if the initialized has already been done.  */
> -	  if ((val & 2) != 0)
> +	  /* Check if the initialization has already been done.  */
> +	  if (__builtin_expect ((val & 2) != 0, 1))
>  	    return 0;
>  
>  	  oldval = val;
> -	  newval = (oldval & 3) | __fork_generation | 1;
> +	  /* We try to set the state to in-progress and having the current
> +	     fork generation.  We don't need atomic accesses for the fork
> +	     generation because it's immutable in a particular process, and
> +	     forked child processes start with a single thread that modified
> +	     the generation.  */
> +	  newval = __fork_generation | 1;
> +	  /* We need acquire memory order here for the same reason as for the
> +	     load from once_control above.  */
>  	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
>  						     oldval);
>  	}
> @@ -59,9 +90,10 @@ __pthread_once (once_control, init_routine)
>        /* Check if another thread already runs the initializer.	*/
>        if ((oldval & 1) != 0)
>  	{
> -	  /* Check whether the initializer execution was interrupted
> -	     by a fork.	 */
> -	  if (((oldval ^ newval) & -4) == 0)
> +	  /* Check whether the initializer execution was interrupted by a
> +	     fork. (We know that for both values, bit 0 is set and bit 1 is
> +	     not.)  */
> +	  if (oldval == newval)
>  	    {
>  	      /* Same generation, some other thread was faster. Wait.  */
>  	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
> @@ -79,8 +111,11 @@ __pthread_once (once_control, init_routine)
>        pthread_cleanup_pop (0);
>  
>  
> -      /* Add one to *once_control.  */
> -      atomic_increment (once_control);
> +      /* Mark *once_control as having finished the initialization.  We need
> +         release memory order here because we need to synchronize with other
> +         threads that want to use the initialized data.  */
> +      atomic_write_barrier();
> +      *once_control = 2;
>  
>        /* Wake up all other threads.  */
>        lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]