This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.


On Thu, 2015-07-02 at 23:48 +0200, OndÅej BÃlka wrote:
> On Thu, Jul 02, 2015 at 03:25:02PM +0200, Torvald Riegel wrote:
> > > > > * The scalability problem is actually at the mutex side; the condvar
> > > > > could help (and it tries to with the requeue optimization), but it
> > > > > should be the mutex who decides how that is done, and whether it is done
> > > > > at all.
> > > > > * Forcing all but one waiter into the kernel-side wait queue of the
> > > > > mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
> > > > > prevents the only cure against the underlying scalability problem
> > > > > inherent to condvars.
> > > 
> > > How exactly that harms lock elision? 
> > 
> > You serialize the woken-up waiters, and lock elision tries to run
> > non-conflicting critical sections in parallel.  Thus, serializing
> > up-front prevents lock elision from trying to run them in parallel.
> >
> Which doesn't help when waiters cause lock to be contended.

That's why I wrote that if we decide to not requeue, we would remove the
hack that puts the lock into contended state.

> > > > > * If condvars use short critical sections (ie, hold the mutex just to
> > > > > check a binary flag or such), which they should do ideally, then forcing
> > > > > all those waiter to proceed serially with kernel-based hand-off (ie,
> > > > > futex ops in the mutex' contended state, via the futex wait queues) will
> > > > > be less efficient than just letting a scalable mutex implementation take
> > > > > care of it.  Our current mutex impl doesn't employ spinning at all, but
> > > > > if critical sections are short, spinning can be much better.
> > > 
> > > That looks too complicate code much, how do you want to pass information
> > > do differentiate between signal/broadcast?
> > 
> > I don't understand your question.  How does it relate to my paragraph
> > you replied to?
> >
> A problem is that mutex doesn't just protect short critical section, for
> example its C++ monitor with other method that locks for long time.

Yes, we may not just be dealing with short critical sections.  However,
if we have long critical sections on the waiter side, and broadcast or
many signals wake many waiters, then some potential slowdown due to
contention is less important because the program isn't scalable in the
first place (ie, has many long critical sections that need to be
serialized).  Specifically, one of the long critical sections will get
the lock; while it has it, the others will sort things out, in parallel
with the critical section that is running.  That means we may do some
extra work, but it won't slow down the critical section that has the
lock.

> > > > > * Doing the requeue stuff requires all waiters to always drive the mutex
> > > > > into the contended state.  This leads to each waiter having to call
> > > > > futex_wake after lock release, even if this wouldn't be necessary.
> > > > >
> > > 
> > > That is most important point. My hypotesis is that mutex will almost
> > > always be unlocked for threads after wake.
> > 
> > Well, that depends on how both the mutex and the condvar are used,
> > actually.
> > 
> I would say that scheduling has more impact than use case as I tried to
> write below.
> 
> > > Here question is how does
> > > wake and scheduler interact. Here bad case would be effective wake that
> > > could simultaneously schedule threads to free cores and all would
> > > collide. A better deal would be if there would be 1000 cycle delay
> > > between different cores as when next thread tried to lock previous
> > > thread would be already done.
> > 
> > I don't understand those sentences.
> > 
> I did model when threads don't block after broadcast until they release
> lock.
> 
> Initially threads occupy a k free cores and there is contention.

What kind of contention do you mean?  Contention for free compute
resources, or contention on a lock (ie, the high-level resource), or
contention on a memory location (eg, many concurrent acquisition
attempts on the same lock)?

> After k
> threads acquire lock situation stabilizes as each core runs that thread
> until it blocks. Contention is less likely as you need that happen for
> two threads in interval smaller than it takes one thread to get lock, do
> stuff and release lock. If in initial phase threads are not started
> simultaneously but with some delay contention could be lower as thread
> completed its job.
> 
> > > So how that behaves in practice?
> > 
> > I don't understand this question.
> >
> Just wanted to know how often lock takes fast path after broadcast.

I still don't understand what you're trying to ask, sorry.

Whether the lock implementation sees an uncontended lock is really up to
the workload.  And even then the lock itself can spin for a while (which
we don't really do, even ADAPTIVE_MUTEX has a very simplistic spinning)
to not actually block via futexes (ie, the slow path).  So, this is more
about the lock implementation and the workload than about what the
condvar is doing; the existing condvar code tries to be nicer to the
existing lock implementation but, as I wrote, I don't think this is
really effective, prevents lock elision usage, and is likely better
addressed by improving the scalability of our lock implementation.

Does that answer your question?

> > > > > PI support is "kind of" included.  There is no internal lock anymore, so
> > > > > the thing that Darren proposed the fix for is gone.  So far so good;
> > > > > however, we don't requeue, and Darren's paper states that requeue would
> > > > > yield better latency in the PI scenario (is this still the case?).
> > > > >
> > > You have problem that when kernel keeps FIFO api requeue gives you fairness while
> > > with waking everything a important thread could be buried by lower
> > > priority threads that after each broadcast do something small and wait
> > > for broadcast again. 
> > 
> > If you wake several threads, then the those with highest priority will
> > run.
> 
> While they will run that doesn't mean that they will win a race. For
> example you have 4 core cpu and, 3 low priority threads and one high
> priority one. Each will check condition. If true then it will consume 
> data causing condition be false. If condition not met he will wait again. 
> Then high priority thread has only 1/4 chance to be first to get lock 
> to consume data. With requeue he could always do it.

No, if the lock is of a PI kind, the lower prio thread will acquire it,
put it's TID as lock owner, which then will allow a higher-prio to boost
the priority of the lower prio thread.  So, the locks itself are fine.
There is no guarantee that a higher-prio thread will be able to grab a
signal when a lower-prio thread tries to grab one concurrently.  But
that's not something that the condvar is required to guarantee.  (But
note that what the new algorithm fixes is that when a waiter is eligible
for consuming a signal (and there's no other thread that's eligible too
and could grab it first), a waiter that starts waiting after the signal
(ie, isn't eligible) cannot steal the signal anymore). 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]