This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Unify pthread_once (bug 15215)


On 04/07/2014 08:37 AM, Torvald Riegel wrote:
> On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
>> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
>>
>> We would like to unify all C-based pthread_once implmentations
>> per the plan in bug 15215 for glibc 2.20.
>>
>> Your machines are on the list of C-based pthread_once implementations.
>>
>> See this for the intial discussions on the unified pthread_once:
>> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
>>
>> The goal is to provide a single and correct C implementation of 
>> pthread_once. Architectures can then build on that if they need more 
>> optimal implementations, but I don't encourage that and I'd rather
>> see deep discussions on how to make one unified solution where
>> possible.
>>
>> I've also just reviewed Torvald's new pthread_once microbenchmark which
>> you can use to compare your previous C implementation with the new
>> standard C implementation (measures pthread_once latency). The primary
>> use of this test is to help provide objective proof for or against the
>> i386 and x86_64 assembly implementations.
>>
>> We are not presently converting any of the machines with custom
>> implementations, but that will be a next step after testing with the
>> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
>>
>> If we don't hear any objections we will go forward with this change
>> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
>> and aarch64 on a single pthread_once implementation based on sparc's C
>> implementation.

This version looks good to me.

Please check it in after fixing the one nit where you needed double 
space after a period.

Follow this if you're rusty:
https://sourceware.org/glibc/wiki/Committer%20checklist

> So far, I've seen an okay for tile, and a question about ARM.  Will, are
> you okay with the change for ARM?
> 
> Any other objections to the updated patch that's attached?

As I mentioned in my other email, this cleanup is only for targets using
generic C code implementations which we have shown to be missing barriers.
Converting these C-only targets is the right thing to do. Converting the
assembly implementations is going to be more work.

Next steps:
* Check this in.
* Send another notification to the maintainers about the change.
  - This gives them another chance to look at the benchmark numbers.
* Work with any arch maintainers to look at performance losses.

>>> +   When forking the process, some threads can be interrupted during the second
>>> +   state; they won't be present in the forked child, so we need to restart
>>> +   initialization in the child.  To distinguish an in-progress initialization
>>> +   from an interrupted initialization (in which case we need to reclaim the
>>> +   lock), we look at the fork generation that's part of the second state: We
>>> +   can reclaim iff it differs from the current fork generation.
>>> +   XXX: This algorithm has an ABA issue on the fork generation: If an
>>> +   initialization is interrupted, we then fork 2^30 times (30b of once_control
>>
>> What's "30b?" 30 bits? Please spell it out.
>>
>>> +   are used for the fork generation), and try to initialize again, we can
>>> +   deadlock because we can't distinguish the in-progress and interrupted cases
>>> +   anymore.  */
>>
>> Would you mind filing a bug for this in the upstream bugzilla?
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=16816
> 
>> It's a distinct bug from this unification work, but a valid problem.
>>
>> Can this be fixed by detecting generation counter overflow in fork
>> and failing the function call?
> 
> Yes, but this would prevent us from doing more than 2^30 fork calls.
> That may not be a problem in practice -- but if so, then we won't hit
> the ABA either :)

It's probably not a problem, because 2^30 forks of even a 1MB process
is going to need 1 Petabyte or more of memory/swap, but still...

A security issue is introduced here in that early corruption of the fork
generation counter could lead to deadlock. We close that window slightly
by doing a sanity check on the generation counter to detect overflow.
It doesn't fix all cases, but it means you can't easily corrupt the gen
counter early and then wait for the fork to deadlock. You now need to
corrupt the fork generation counter after the check which is a smaller
window.

Either way I think an assert on overflow in fork.c is needed, but that's
another fix that I expect you to submit after this one. Note that the
implementation is in: nptl/sysdeps/unix/sysv/linux/fork.c, and the
limit of 2^30 forks only applies to applications linked against libpthread
which provides a strong definition of fork that overrides libc's weak
definition (which does a lot less). In the dynamic case libpthread's
version of fork is used because it is loaded first since it depends on libc
(remember that weak/strong are not applied to dynamic libraries per ELF
rules).

>>> +      do
>>> +	{
>>> +	  /* Check if the initialization has already been done.  */
>>> +	  if (__builtin_expect ((val & 2) != 0, 1))
>>
>> Use __glibc_likely.
>>
>> e.g. if (__glibc_likely ((val & 2) != 0))
>>
>> This is the fast path that we are testing for in the microbenchmark?
> 
> Yes.

Good.

>>> +	    return 0;
>>> +
>>> +	  oldval = val;
>>> +	  /* We try to set the state to in-progress and having the current
>>> +	     fork generation.  We don't need atomic accesses for the fork
>>> +	     generation because it's immutable in a particular process, and
>>> +	     forked child processes start with a single thread that modified
>>> +	     the generation.  */
>>> +	  newval = __fork_generation | 1;
>>
>> OT: I wonder if Valgrind will report a benign race in accessing __fork_generation.
> 
> Perhaps.  I believe that eventually, lots of this and similar variables
> should be atomic-typed and/or accessed with relaxed-memory-order atomic
> loads.  This would clarify that we expect concurrent accesses and that
> they don't constitute a data race.

I don't see how Valgrind would know this from the binary itself, but
I guess this will just need to have per-glibc-version exceptions for
Valgrind.

Your ChangeLog still needs to follow the normal format, including
header line with date and name, blank line, and tab before text on
lines thereafter.

e.g.

2014-04-07  Torvald Riegel  <triegel@redhat.com>

	[BZ #15215]
	* nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c: Moved to ...
	* nptl/sysdeps/unix/sysv/linux/pthread_once.c: ... here.  Add missing
	memory barriers.  Add comments.
	* sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c: Remove file.

Changelog.hppa:

2014-04-07  Torvald Riegel  <triegel@redhat.com>

	[BZ #15215]
	* sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c: Remove file.

Don't forget to update NEWS

> diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
> new file mode 100644
> index 0000000..8453d2d
> --- /dev/null
> +++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
> @@ -0,0 +1,131 @@
> +/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "pthreadP.h"
> +#include <lowlevellock.h>
> +#include <atomic.h>
> +
> +
> +unsigned long int __fork_generation attribute_hidden;
> +
> +
> +static void
> +clear_once_control (void *arg)
> +{
> +  pthread_once_t *once_control = (pthread_once_t *) arg;
> +
> +  /* Reset to the uninitialized state here.  We don't need a stronger memory
> +     order because we do not need to make any other of our writes visible to
> +     other threads that see this value: This function will be called if we
> +     get interrupted (see __pthread_once), so all we need to relay to other
> +     threads is the state being reset again.  */

OK.

> +  *once_control = 0;
> +  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
> +}
> +
> +
> +/* This is similar to a lock implementation, but we distinguish between three
> +   states: not yet initialized (0), initialization finished (2), and
> +   initialization in progress (__fork_generation | 1).  If in the first state,
> +   threads will try to run the initialization by moving to the second state;
> +   the first thread to do so via a CAS on once_control runs init_routine,
> +   other threads block.
> +   When forking the process, some threads can be interrupted during the second
> +   state; they won't be present in the forked child, so we need to restart
> +   initialization in the child.  To distinguish an in-progress initialization
> +   from an interrupted initialization (in which case we need to reclaim the
> +   lock), we look at the fork generation that's part of the second state: We
> +   can reclaim iff it differs from the current fork generation.
> +   XXX: This algorithm has an ABA issue on the fork generation: If an
> +   initialization is interrupted, we then fork 2^30 times (30 bits of
> +   once_control are used for the fork generation), and try to initialize
> +   again, we can deadlock because we can't distinguish the in-progress and
> +   interrupted cases anymore.  */

OK.

> +int
> +__pthread_once (once_control, init_routine)
> +     pthread_once_t *once_control;
> +     void (*init_routine) (void);
> +{
> +  while (1)
> +    {
> +      int oldval, val, newval;
> +
> +      /* We need acquire memory order for this load because if the value
> +         signals that initialization has finished, we need to be see any
> +         data modifications done during initialization.  */
> +      val = *once_control;
> +      atomic_read_barrier();
> +      do
> +	{
> +	  /* Check if the initialization has already been done.  */
> +	  if (__glibc_likely ((val & 2) != 0))

OK.

> +	    return 0;
> +
> +	  oldval = val;
> +	  /* We try to set the state to in-progress and having the current
> +	     fork generation.  We don't need atomic accesses for the fork
> +	     generation because it's immutable in a particular process, and
> +	     forked child processes start with a single thread that modified
> +	     the generation.  */
> +	  newval = __fork_generation | 1;
> +	  /* We need acquire memory order here for the same reason as for the
> +	     load from once_control above.  */
> +	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
> +						     oldval);
> +	}
> +      while (__glibc_unlikely (val != oldval));

OK.

> +
> +      /* Check if another thread already runs the initializer.	*/
> +      if ((oldval & 1) != 0)
> +	{
> +	  /* Check whether the initializer execution was interrupted by a
> +	     fork. We know that for both values, bit 0 is set and bit 1 is

s/. We/.  We/g

> +	     not.  */
> +	  if (oldval == newval)
> +	    {
> +	      /* Same generation, some other thread was faster. Wait.  */
> +	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
> +	      continue;
> +	    }
> +	}
> +
> +      /* This thread is the first here.  Do the initialization.
> +	 Register a cleanup handler so that in case the thread gets
> +	 interrupted the initialization can be restarted.  */
> +      pthread_cleanup_push (clear_once_control, once_control);

OK.

> +
> +      init_routine ();
> +
> +      pthread_cleanup_pop (0);
> +
> +
> +      /* Mark *once_control as having finished the initialization.  We need
> +         release memory order here because we need to synchronize with other
> +         threads that want to use the initialized data.  */
> +      atomic_write_barrier();
> +      *once_control = 2;

OK.

> +
> +      /* Wake up all other threads.  */
> +      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);

OK.

> +      break;
> +    }
> +
> +  return 0;
> +}
> +weak_alias (__pthread_once, pthread_once)
> +hidden_def (__pthread_once)

[snip removal of other pthread_once implementations]

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]