This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
- From: Rich Felker <dalias at libc dot org>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, Marcus Shawcroft <marcus dot shawcroft at gmail dot com>, "Joseph S. Myers" <joseph at codesourcery dot com>, Richard Henderson <rth at redhat dot com>, Carlos O'Donell <codonell at redhat dot com>, Mike Frysinger <vapier at gentoo dot org>, David Holsgrove <david dot holsgrove at xilinx dot com>, Chung-Lin Tang <chunglin_tang at mentor dot com>, Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, Andreas Krebbel <krebbel at linux dot ibm dot com>, Kaz Kojima <kkojima at rr dot iij4u dot or dot jp>, Chris Metcalf <cmetcalf at tilera dot com>, David Miller <davem at davemloft dot net>, Darren Hart <dvhart at infradead dot org>
- Date: Sun, 22 Feb 2015 17:37:22 -0500
- Subject: Re: [PATCH] New condvar implementation that provides stronger ordering guarantees.
- Authentication-results: sourceware.org; auth=none
- References: <1424456307 dot 20941 dot 122 dot camel at triegel dot csb>
On Fri, Feb 20, 2015 at 07:18:27PM +0100, Torvald Riegel wrote:
> + Limitations:
> + * This condvar isn't designed to allow for more than
> + WSEQ_THRESHOLD * (1 << (sizeof(GENERATION) * 8 - 1)) calls to
> + __pthread_cond_wait. It probably only suffers from potential ABA issues
> + afterwards, but this hasn't been checked nor tested.
> + * More than (1 << (sizeof(QUIESCENCE_WAITERS) * 8) -1 concurrent waiters
> + are not supported.
> + * Beyond what is allowed as errors by POSIX or documented, we can also
> + return the following errors:
> + * EPERM if MUTEX is a recursive mutex and the caller doesn't own it.
This is not beyond POSIX; it's explicitly specified as a "shall fail".
> + * EOWNERDEAD or ENOTRECOVERABLE when using robust mutexes. Unlike
> + for other errors, this can happen when we re-acquire the mutex; this
> + isn't allowed by POSIX (which requires all errors to virtually happen
> + before we release the mutex or change the condvar state), but there's
> + nothing we can do really.
Likewise these are "shall fail" errors specified by POSIX, and while
it's not clearly written in the specification, it's clear that they
only happen on re-locking.
One ugly case I don't think you're prepared to handle is when
EOWNERDEAD happens in the cancellation cleanup path. There's no way
for the caller to know this error happened and thereby no way for it
to recover the state and make it consistent before unlocking the
mutex.
The new cancellation state I'm proposing (PTHREAD_CANCEL_MASKED; see
http://www.openwall.com/lists/musl/2015/02/22/1 for details) makes it
possible to solve this problem by refusing to act on cancellation
(leaving it pending), and instead returning as if by a spurious wake
occurring just before cancellation, so that the error can be reported.
> + * EAGAIN if MUTEX is a recursive mutex and trying to lock it exceeded
> + the maximum number of recursive locks. The caller cannot expect to own
> + MUTEX.
This can't happen since the lock count was already decremented once
and is only incremented back to its original value. (And in practice
the original value has to be exactly 1 or other threads could never
modify the state on which the predicate depends.)
> + * When using PTHREAD_MUTEX_PP_* mutexes, we can also return all errors
> + returned by __pthread_tpp_change_priority. We will already have
> + released the mutex in such cases, so the caller cannot expect to own
> + MUTEX.
This seems correct. This is another case where you would want to be
able to suppress acting upon cancellation and instead cause a spurious
wake.
> + Other notes:
> + * Instead of the normal mutex unlock / lock functions, we use
> + __pthread_mutex_unlock_usercnt(m, 0) / __pthread_mutex_cond_lock(m)
> + because those will not change the mutex-internal users count, so that it
> + can be detected when a condvar is still associated with a particular
> + mutex because there is a waiter blocked on this condvar using this mutex.
I don't follow. What is this users count? Is it the recursive lock
count or something else? Why do you want to be able to detect that
it's associated with a cv? Note that it could be associated with more
than one cv.
Rich