This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Torvald Riegel <triegel at redhat dot com>
Cc: GLIBC Devel <libc-alpha at sourceware dot org>, Marcus Shawcroft <marcus dot shawcroft at gmail dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, Richard Henderson <rth at redhat dot com>, Carlos O'Donell <codonell at redhat dot com>, Mike Frysinger <vapier at gentoo dot org>, David Holsgrove <david dot holsgrove at xilinx dot com>, Chung-Lin Tang <chunglin_tang at mentor dot com>, Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, Andreas Krebbel <krebbel at linux dot ibm dot com>, Kaz Kojima <kkojima at rr dot iij4u dot or dot jp>, Chris Metcalf <cmetcalf at tilera dot com>, David Miller <davem at davemloft dot net>, Darren Hart <dvhart at infradead dot org>
Date: Thu, 2 Jul 2015 00:15:13 +0200
Subject: Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
Authentication-results: sourceware.org; auth=none
References: <1424456307 dot 20941 dot 122 dot camel at triegel dot csb> <1431713889 dot 25070 dot 17 dot camel at triegel dot csb>

On Fri, May 15, 2015 at 08:18:09PM +0200, Torvald Riegel wrote:
> Ping, 3 months later.
> 
> I believe Siddhesh has been testing this in rawhide, and at least I
> haven't heard of any breakage.  Nonetheless, the new implementation is
> complex, so getting an actual review before this gets checked in would
> be good.  At the very least, it would be good if someone could confirm
> that the comments are sufficient and easy to follow; alternatively,
> please complain if you think more detail is needed.
> 
> On Fri, 2015-02-20 at 19:18 +0100, Torvald Riegel wrote:
> > TLDR: This is a new implementation for condition variables, required
> > after http://austingroupbugs.net/view.php?id=609 to fix bug 13165.  In
> > essence, we need to be stricter in which waiters a signal or broadcast
> > is required to wake up; this couldn't be solved using the old algorithm.
> > ISO C++ made a similar clarification, so this also fixes a bug in
> > current libstdc++, for example.
> > This doesn't contain the existing requeue optimization (see below), and
> > PI support is not worse (except no requeue).
> > 
> > 
> > Arch maintainers: This adapts each arch's definition of pthread_cond_t.
> > Only x86, x86_64, and hppa have significant arch-specific changes.  I'd
> > appreciate review considering we want to stay compatible with old
> > initializers, and we don't want existing padding to alias with bytes
> > that are now used for real fields (see text on pthread_cond_t below for
> > details).
> > 
> > 
> > And here's the detailed version :)
> > 
> > We can't use the old algorithm, which tries to avoid spurious wake-ups,
> > anymore because there's no way (AFAICT) to wake in FIFO order from a
> > futex (the current Linux implementation may do today, but it's not
> > guaranteed). 

And what is performance difference between old algorithm and proposed
one? If there is noticable difference wouldn't be better to write patch
into kernel to for example AT_MONOTONE_FUTEX auxval set to 1 and when
kernel developers decide for nonmonotone wakeup they will set that to 0.


> 
> > 
> > There's another issue specific to condvars: ABA issues on the underlying
> > futexes.  Unlike mutexes that have just three states, or semaphores that
> > have no tokens or a limited number of them, the state of a condvar is
> > the *order* of the waiters.  With a semaphore, we can grab whenever
> > there is a token, and block in the futex_wait when there is none.  With
> > condvars, a waiter must not block if there had been more
> > signals/broadcasts than waiters before it (ie, ordering matters).
> > 
> > Futex words in userspace (ie, those memory locations that control
> > whether futex_wait blocks or not) are 32b.  Therefore, we can have ABA
> > issues on it, which could lead to lost wake-ups because a waiter simply
> > can't distinguish between no new signals being sent and lots of signals
> > being sent (2^31 in this implementation).
> > 
> > It might be unlikely that this occurs, and needs a specific scenario,
> > but I'm not comfortable with just ignoring it -- knowingly.  Therefore,
> > this patch avoids the ABA issues by quiescing the condvar before an
> > overflow on the internal counters for the number of waiters /
> > signals-sent happens, and then resets the condvar.  This just increases
> > the number of spurious wake-ups while we do the reset on non-PI condvars
> > (but it is a problem for PI; see below).  The quiescence handling does
> > add complexity, but it seems not excessive relative to the rest of the
> > algorithm.
> >
That looks reasonable. I cannot imagine how to reach that number with
reasonable scheduler.
 
> > 
> > I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because
> > they both share most of the code.  __pthread_cond_wait_common is
> > currently an always_inline, but we might also make that a noinline if
> > you're concerned over statically linked programs that use either the
> > timed or the non-timed cond_wait variant.
> >
I would be more worried about code size, removing inline could help
with cache usage. That doesn't have to be gain when most programs call just
wait but not timed_wait.
 
> > 
> > x86 had custom assembly implementations.  Given that this patch fixes a
> > correctness problem, I've just removed them.  Additionally, there's no
> > real fast-path in cond_wait unless perhaps if we want to consider just
> > spin for a signal to become available as a fast path; in all other
> > cases, we have to wait, so that's cache misses at least.  signal and
> > broadcast could be considered fast paths; the new code doesn't use an
> > internal lock anymore, so they should have become faster (e.g.,
> > cond_signal is just a CAS loop now and a call to futex_wait (that we
> > might avoid too with some more complexity).
> >
Nice I said last year that assembly makes no sense here. Only thing you
could microptimize is signal/broadcast without waiters. Otherwise you
consisder wait+signal pair for each syscall overhead will dominate 
few cycles that you tried to save by assembly.

> > 
> > This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
> > waking just one thread and requeueing all others on the futex of the
> > mutex supplied by the program).  I don't think doing the requeue is
> > necessarily the right approach (but I haven't done real measurements
> > yet):
> > * If a program expects to wake many threads at the same time and make
> > that scalable, a condvar isn't great anyway because of how it requires
> > waiters to operate mutually exclusive (due to the mutex usage).  Thus, a
> > thundering herd problem is a scalability problem with or without the
> > optimization.  Using something like a semaphore might be more
> > appropriate in such a case.

I would focus more on cases where for example when you have four 
consumer threads that wait for data. I don't know what happens. You
could write LD_PRELOAD wrapper to see what program do with condition
variables and average number of waiters. There are plenty of binaries
that use these.

> > * The scalability problem is actually at the mutex side; the condvar
> > could help (and it tries to with the requeue optimization), but it
> > should be the mutex who decides how that is done, and whether it is done
> > at all.
> > * Forcing all but one waiter into the kernel-side wait queue of the
> > mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
> > prevents the only cure against the underlying scalability problem
> > inherent to condvars.

How exactly that harms lock elision? 


> > * If condvars use short critical sections (ie, hold the mutex just to
> > check a binary flag or such), which they should do ideally, then forcing
> > all those waiter to proceed serially with kernel-based hand-off (ie,
> > futex ops in the mutex' contended state, via the futex wait queues) will
> > be less efficient than just letting a scalable mutex implementation take
> > care of it.  Our current mutex impl doesn't employ spinning at all, but
> > if critical sections are short, spinning can be much better.

That looks too complicate code much, how do you want to pass information
do differentiate between signal/broadcast?

> > * Doing the requeue stuff requires all waiters to always drive the mutex
> > into the contended state.  This leads to each waiter having to call
> > futex_wake after lock release, even if this wouldn't be necessary.
> >

That is most important point. My hypotesis is that mutex will almost
always be unlocked for threads after wake. Here question is how does
wake and scheduler interact. Here bad case would be effective wake that
could simultaneously schedule threads to free cores and all would
collide. A better deal would be if there would be 1000 cycle delay
between different cores as when next thread tried to lock previous
thread would be already done. So how that behaves in practice?

I expect that when unlocking happens reasonably fast a requeue could
help only for collision of first few threads, then it would be locked
only when next thread wakes up and there is enough noise in scheduling
to make cost of collisions less than always waking up.
 
> > Therefore, this condvar doesn't do requeue currently.  I'd like to get
> > your opinion on this.
> > Once we agree on a way forward, I'd either (1) adapt the condvar to use
> > requeue or (2) adapt the _cond_ variants of the lowlevellock and
> > pthread_mutex_* to not always drive the mutex into the contended state.
> > Here's a sketch of how we could implement requeue (IOW, make sure we
> > don't requeue to the wrong mutex):
> > * Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for
> > whether the mutex associated with the condvar changed.  Add proper
> > masking of it, adapt WSEQ_THRESHOLD accordingly, etc.
> > * Let waiters do this:
> >   if (mutex != cond->mutex) {
> >     atomic_store_relaxed (&cond->mutex, newmutex);
> >     atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT);
> >   }
> >   futex_wait(...)
> > * Let broadcast do:
> >   s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add);
> >   if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */
> >   m = atomic_load_relaxed (&cond->mutex);
> >   futex_cmp_requeue (..., s + signals_to_add /* expected value */,
> >      ..., m /* expected mutex */
> > This would ensure that if a futex_wait on a new mutex comes in,
> > broadcast will grab the new mutex or futex_cmp_requeue will fail (it
> > will see the signals_sent update because of futex op ordering).
> > 
> > 
> > PI support is "kind of" included.  There is no internal lock anymore, so
> > the thing that Darren proposed the fix for is gone.  So far so good;
> > however, we don't requeue, and Darren's paper states that requeue would
> > yield better latency in the PI scenario (is this still the case?).
> >
You have problem that when kernel keeps FIFO api requeue gives you fairness while
with waking everything a important thread could be buried by lower
priority threads that after each broadcast do something small and wait
for broadcast again. 

 
> > Darren, I'd propose that we figure out how to best adapt this new
> > implementation to do PI.  I'm looking forward to your comments.
> > 
> > One thing I don't think we'll be able to solve is ensuring PI during
> > quiescence.  When doing quiescence, we need for all waiters to go out of
> > the condvar, so confirm their wake-up.  I can't see a way of boosting
> > their priorities if they get suspended between releasing the mutex and
> > starting to wait; there's no app lock they still hold, and we can't give
> > wwaiters per-waiter locks linked from the condvar that we could use to
> > boost individual threads because this doesn't work in the process-shared
> > case.  Maybe there's something I'm missing (but I though a while about
> > it ;), and maybe there's some PI-futex functionality that I wasn't aware
> > of that we could (ab)use.
> > Thus, the most practical approach seems to be to just not do any PI
> > during quiescence (ie, every 2^31 cond_wait calls).  Any alternative
> > suggestions?
> > 

No, not doing anything looks as best solution, even if there would be
solution it would make code more complex. As situation is equivalent to
sequence to first send broadcast to wake all waiters, then each waiter
will call wait again a PI algorithm should stabilize quickly after that.


> > This problem would go away if we had 64b futexes because then we
> > wouldn't need quiescence anymore (assuming 64b counters avoid ABA).
> > 
> > 
> > Tested on x86_64-linux and x86-linux.
> >

Follow-Ups:
- Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]