This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.


Ping, 3 months later.

I believe Siddhesh has been testing this in rawhide, and at least I
haven't heard of any breakage.  Nonetheless, the new implementation is
complex, so getting an actual review before this gets checked in would
be good.  At the very least, it would be good if someone could confirm
that the comments are sufficient and easy to follow; alternatively,
please complain if you think more detail is needed.

On Fri, 2015-02-20 at 19:18 +0100, Torvald Riegel wrote:
> TLDR: This is a new implementation for condition variables, required
> after http://austingroupbugs.net/view.php?id=609 to fix bug 13165.  In
> essence, we need to be stricter in which waiters a signal or broadcast
> is required to wake up; this couldn't be solved using the old algorithm.
> ISO C++ made a similar clarification, so this also fixes a bug in
> current libstdc++, for example.
> This doesn't contain the existing requeue optimization (see below), and
> PI support is not worse (except no requeue).
> 
> 
> Arch maintainers: This adapts each arch's definition of pthread_cond_t.
> Only x86, x86_64, and hppa have significant arch-specific changes.  I'd
> appreciate review considering we want to stay compatible with old
> initializers, and we don't want existing padding to alias with bytes
> that are now used for real fields (see text on pthread_cond_t below for
> details).
> 
> 
> And here's the detailed version :)
> 
> We can't use the old algorithm, which tries to avoid spurious wake-ups,
> anymore because there's no way (AFAICT) to wake in FIFO order from a
> futex (the current Linux implementation may do today, but it's not
> guaranteed).  Thus, when we wake, we can't simply let someone grab a
> signal, but we need to ensure that one of the waiters happening before
> the signal is woken up -- not just any waiter.  This is something the
> previous algorithm violated (see bug 13165).
> The problem with having to wake in-order and trying to prevent spurious
> wake-ups is that one would have to encode the order, which one needs
> space for (e.g., for separate futexes).  But pthread_cond_t is limited
> in space, and we can't use external space for process-shared condvars.
> 
> The primary reason for spurious wake-ups with this new algorithm is the
> following:  If we have waiters W1 and W2, and W2 registers itself as a
> waiter later than W1 does, and if W2 blocks earlier than W1 using a
> futex, then a signal will wake both W1 and W2.  IOW, this is when one
> waiter races ahead of an earlier one when doing futex_wait.  Once they
> both wait, or if they keep their order, there's no spurious wake-ups.
> 
> We could avoid more of these spurious wake-ups by maintaining at least
> two groups of waiters that approximate the waiter order (e.g., have one
> group that is eligible for wake-up, and drained with priority, and a
> second that is catch-all and will become the new first group when that
> is drained).  But this would add substantial complexity to the
> algorithm, and it may be a tight fit into the size of pthread_cond_t we
> have today.
> 
> 
> There's another issue specific to condvars: ABA issues on the underlying
> futexes.  Unlike mutexes that have just three states, or semaphores that
> have no tokens or a limited number of them, the state of a condvar is
> the *order* of the waiters.  With a semaphore, we can grab whenever
> there is a token, and block in the futex_wait when there is none.  With
> condvars, a waiter must not block if there had been more
> signals/broadcasts than waiters before it (ie, ordering matters).
> 
> Futex words in userspace (ie, those memory locations that control
> whether futex_wait blocks or not) are 32b.  Therefore, we can have ABA
> issues on it, which could lead to lost wake-ups because a waiter simply
> can't distinguish between no new signals being sent and lots of signals
> being sent (2^31 in this implementation).
> 
> It might be unlikely that this occurs, and needs a specific scenario,
> but I'm not comfortable with just ignoring it -- knowingly.  Therefore,
> this patch avoids the ABA issues by quiescing the condvar before an
> overflow on the internal counters for the number of waiters /
> signals-sent happens, and then resets the condvar.  This just increases
> the number of spurious wake-ups while we do the reset on non-PI condvars
> (but it is a problem for PI; see below).  The quiescence handling does
> add complexity, but it seems not excessive relative to the rest of the
> algorithm.
> 
> 
> This algorithm satisfies the equivalent of the strong mutex destruction
> guarantee.  However, unlike mutexes, because of spurious wake-ups being
> allowed a correct program has to effectively ensure that destruction
> happens after signals/broadcast calls return.  Thus, the destruction
> requirement in POSIX is not as restrictive as with semaphores, but we
> still need to take care.
> 
> 
> If you want to dive into the code, it's best to start with the comments
> on top of __pthread_cond_wait_common in nptl/pthread_cond_wait.c.
> 
> One notable difference to the previous implementation is that the new
> code doesn't use an internal lock anymore.  This simplifies the PI
> implementation (see below), and should speed up things like concurrent
> signals/broadcasts, and the general hand-off between wait and
> signal/broadcast.
> 
> I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because
> they both share most of the code.  __pthread_cond_wait_common is
> currently an always_inline, but we might also make that a noinline if
> you're concerned over statically linked programs that use either the
> timed or the non-timed cond_wait variant.
> 
> pthread_cond_t is the same on all archs (except on hppa, see below, and
> m68k which enforces 4-byte alignment of the first int).  The new
> algorithm needs less space (int instead of long long int in the case of
> three fields), so the old initializers should remain working.  The x86
> version looks fine for me, but I'd appreciate (an)other set(s) of eyes
> on this aspect.  We don't need stronger alignment for the new algorithm.
> 
> Carlos: the hppa changes are completely untested.  They change the
> pthread-once-like code to C11 atomics, which fixes one missing compiler
> barrier (acquire load was a plain load).  Let me know if you object to
> these.
> 
> x86 had custom assembly implementations.  Given that this patch fixes a
> correctness problem, I've just removed them.  Additionally, there's no
> real fast-path in cond_wait unless perhaps if we want to consider just
> spin for a signal to become available as a fast path; in all other
> cases, we have to wait, so that's cache misses at least.  signal and
> broadcast could be considered fast paths; the new code doesn't use an
> internal lock anymore, so they should have become faster (e.g.,
> cond_signal is just a CAS loop now and a call to futex_wait (that we
> might avoid too with some more complexity).
> 
> There are three new tests, cond36-cond28, which are variations of
> existing tests that frequently drive a condvar into the quiescence state
> (and thus test quiescence).
> 
> 
> This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
> waking just one thread and requeueing all others on the futex of the
> mutex supplied by the program).  I don't think doing the requeue is
> necessarily the right approach (but I haven't done real measurements
> yet):
> * If a program expects to wake many threads at the same time and make
> that scalable, a condvar isn't great anyway because of how it requires
> waiters to operate mutually exclusive (due to the mutex usage).  Thus, a
> thundering herd problem is a scalability problem with or without the
> optimization.  Using something like a semaphore might be more
> appropriate in such a case.
> * The scalability problem is actually at the mutex side; the condvar
> could help (and it tries to with the requeue optimization), but it
> should be the mutex who decides how that is done, and whether it is done
> at all.
> * Forcing all but one waiter into the kernel-side wait queue of the
> mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
> prevents the only cure against the underlying scalability problem
> inherent to condvars.
> * If condvars use short critical sections (ie, hold the mutex just to
> check a binary flag or such), which they should do ideally, then forcing
> all those waiter to proceed serially with kernel-based hand-off (ie,
> futex ops in the mutex' contended state, via the futex wait queues) will
> be less efficient than just letting a scalable mutex implementation take
> care of it.  Our current mutex impl doesn't employ spinning at all, but
> if critical sections are short, spinning can be much better.
> * Doing the requeue stuff requires all waiters to always drive the mutex
> into the contended state.  This leads to each waiter having to call
> futex_wake after lock release, even if this wouldn't be necessary.
> 
> Therefore, this condvar doesn't do requeue currently.  I'd like to get
> your opinion on this.
> Once we agree on a way forward, I'd either (1) adapt the condvar to use
> requeue or (2) adapt the _cond_ variants of the lowlevellock and
> pthread_mutex_* to not always drive the mutex into the contended state.
> Here's a sketch of how we could implement requeue (IOW, make sure we
> don't requeue to the wrong mutex):
> * Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for
> whether the mutex associated with the condvar changed.  Add proper
> masking of it, adapt WSEQ_THRESHOLD accordingly, etc.
> * Let waiters do this:
>   if (mutex != cond->mutex) {
>     atomic_store_relaxed (&cond->mutex, newmutex);
>     atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT);
>   }
>   futex_wait(...)
> * Let broadcast do:
>   s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add);
>   if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */
>   m = atomic_load_relaxed (&cond->mutex);
>   futex_cmp_requeue (..., s + signals_to_add /* expected value */,
>      ..., m /* expected mutex */
> This would ensure that if a futex_wait on a new mutex comes in,
> broadcast will grab the new mutex or futex_cmp_requeue will fail (it
> will see the signals_sent update because of futex op ordering).
> 
> 
> PI support is "kind of" included.  There is no internal lock anymore, so
> the thing that Darren proposed the fix for is gone.  So far so good;
> however, we don't requeue, and Darren's paper states that requeue would
> yield better latency in the PI scenario (is this still the case?).
> 
> Darren, I'd propose that we figure out how to best adapt this new
> implementation to do PI.  I'm looking forward to your comments.
> 
> One thing I don't think we'll be able to solve is ensuring PI during
> quiescence.  When doing quiescence, we need for all waiters to go out of
> the condvar, so confirm their wake-up.  I can't see a way of boosting
> their priorities if they get suspended between releasing the mutex and
> starting to wait; there's no app lock they still hold, and we can't give
> wwaiters per-waiter locks linked from the condvar that we could use to
> boost individual threads because this doesn't work in the process-shared
> case.  Maybe there's something I'm missing (but I though a while about
> it ;), and maybe there's some PI-futex functionality that I wasn't aware
> of that we could (ab)use.
> Thus, the most practical approach seems to be to just not do any PI
> during quiescence (ie, every 2^31 cond_wait calls).  Any alternative
> suggestions?
> 
> This problem would go away if we had 64b futexes because then we
> wouldn't need quiescence anymore (assuming 64b counters avoid ABA).
> 
> 
> Tested on x86_64-linux and x86-linux.
> 
> 
> 2015-02-20  Torvald Riegel  <triegel@redhat.com>
> 
> 	[BZ #13165]
> 	* nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
> 	use new algorithm.
> 	* nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
> 	* nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
> 	* nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
> 	* nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
> 	(__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
> 	(__condvar_confirm_wakeup, __condvar_cancel_waiting,
> 	__condvar_cleanup_waiting, __condvar_cleanup_quiescence,
> 	__pthread_cond_wait_common): New.
> 	(__condvar_cleanup): Remove.
> 	* npt/pthread_condattr_getclock (pthread_condattr_getclock): Adapt.
> 	* npt/pthread_condattr_setclock (pthread_condattr_setclock): Likewise.
> 	* nptl/tst-cond1.c: Add comment.
> 	* nptl/tst-cond18.c (tf): Add quiescence testing.
> 	* nptl/tst-cond20.c (do_test): Likewise.
> 	* nptl/tst-cond25.c (do_test_wait): Likewise.
> 	* nptl/tst-cond20.c (do_test): Adapt.
> 	* nptl/tst-cond26.c: New file.
> 	* nptl/tst-cond27.c: Likewise.
> 	* nptl/tst-cond28.c: Likewise.
> 	* sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
> 	structure.
> 	* sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/hppa/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/sparc/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
> 	Likewise.
> 	* sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
> 	* sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
> 	(COND_CLOCK_BITS): Adapt.
> 	* sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
> 	* sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
> 	cond_compat_check_and_clear): Adapt.
> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
> 	(__pthread_cond_timedwait): ... and move here.
> 	* nptl/DESIGN-condvar.txt: Remove file.
> 	* nptl/lowlevelcond.sym: Likewise.
> 	* nptl/pthread_cond_timedwait.c: Likewise.
> 	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.
> 




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]