This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[RFC] mutex destruction (#13690): problem description and workarounds
- From: Torvald Riegel <triegel at redhat dot com>
- To: GLIBC Devel <libc-alpha at sourceware dot org>
- Date: Fri, 04 Apr 2014 16:20:30 +0200
- Subject: [RFC] mutex destruction (#13690): problem description and workarounds
- Authentication-results: sourceware.org; auth=none
The problem at hand is that the glibc implementation assumes that
mutexes can be destroyed as soon as no thread is referencing a mutex
anymore in the sense of no function calls that use the mutex being
executed anymore.
This seems to be allowed by the POSIX specs based on current wording.
However, POSIX also has a non-normative example that relies on a
stronger guarantee (or, weaker precondition) for mutex destruction: That
a mutex_unlock does not "reference" a mutex anymore once it can be
locked by another thread. When relying on this stronger guarantee,
programmers can destroy mutexes as soon as they can lock and unlock
them. The clarification request with the Austin Group is still pending:
http://austingroupbugs.net/view.php?id=811
Nonetheless, there's some indication that they will clarify towards the
stronger guarantee (so that the example would work). These semantics
are also required by C++11 mutexes. C11 doesn't really specify this
(the condition that it gives is insufficient in that it doesn't cover
other cases either).
Thus, I think it's useful to start thinking what we're going to do when
we get the stronger guarantee. Likewise, if we want to (continue to)
implement C++11 mutexes on top of pthread mutexes, we'll have to do
something.
=== Details of the problem
#13690 reports on possible accesses to the memory previously occupied by
a destroyed mutex, which are illegal assuming the stronger guarantee.
What happens is that the mutex implementation looks at the mutex kind
after it has unlocked the mutex; it does so to compute the right
arguments to a futex wake-up call.
If a thread destroys a mutex, and then reuses this piece of memory, we
will thus have (read) accesses to either other state or
not-mapped/mprotected memory, potentially resulting in segfaults.
This first sub-issue would be straightforward to fix, but it's actually
the smaller issue here. The bigger one is that we have a pending
FUTEX_WAKE syscall on a memory location that can be reused for something
else. "Pending" here means that a thread can unlock the futex var (ie,
mutex.__data.__lock) and thus make the mutex available to be locked
and/or destructed by other threads, be suspended for whatever reason,
and then at any time later, resume and issue the futex syscall.
If the memory isn't accessible anymore (e.g., unmapped), the pending
futex syscall will just fail and nothing bad happens. Likewise, if the
memory location is at the later time not associated with any other
futex, then nothing bad will happen. However, if the memory is reused
for another futex, then we'll get a spurious FUTEX_WAKE call for
completely unrelated futexes.
This problem is not specific to mutexes, but to the combination of the
design of the non-PI futex operations and memory reuse. There are
similar bugs for semaphores (#12674) and barriers (#13605, but there are
further constraints on when destruction is allowed, which are related to
signals).
PI mutexes are not affected because there, the kernel resets the futex
var in FUTEX_UNLOCK_PI atomically wrt. any FUTEX_LOCK_PI calls; IOW, if
there are threads waiting for the lock that might need to be unblocked,
there will be no pending futex wake-up calls.
=== Workaround 1: Specify that spurious FUTEX_WAKE calls are possible
FUTEX_WAIT is currently specified to return 0 if it got woken by a
FUTEX_WAKE call, and EINTR when wake-up happened due to a signal or
other spurious wakeup. The memory reuse thus possibly allows for
unrelated FUTEX_WAKE calls being injected into the code running a new
futex after memory reuse. If this code relies on the FUTEX_WAIT return
values to be correct (just regarding 0 vs. EINTR!), the pending
unrelated FUTEX_WAKE calls can break it.
However, the existence of such code should be fairly unlikely:
* All futexes that are used for multiple wait/wake-up cycles (like
pthread mutexes, barriers, semaphores, ...) will have to be robust to
their *own* pending FUTEX_WAKEs anyway because this is how futexes are
designed.
* PI futexes do not seem to be woken by FUTEX_WAIT according to some
quick testing: FUTEX_LOCK_PI simply doesn't wake up and FUTEX_WAIT
returns success, whereas FUTEX_WAIT_REQUEUE_PI isn't woken either but
FUTEX_WAIT returns EINVAL (likely due to the other mutex being passed in
being NULL (which we'd need to perhaps check in the glibc code)). I
wasn't able to find any specification for these cases, however.
I've reviewed glibc uses of FUTEX_WAIT (x86_64 and/or generic) and they
should all be robust to pending FUTEX_WAKEs. Likewise for GCC's libgomp
and libitm. I suppose that cases where there might be pending
FUTEX_CMP_REQUEUE calls (e.g., condvars) should be containable by using
a lock (and FUTEX_WAKE) at destruction.
The case that could be affected by such a specification change is thus
single-wake-up futexes in programs that themselves never make those
futexes subject to the pending-FUTEX_WAIT-while-memory-reuse problem.
(I'm mentioning the latter because it's really not just glibc that's
affected, but futex users in general. So, arguably, correct programs
that use futexes have to deal with the issue already.)
Pros:
* Little code change required in glibc to fix the problems.
* Typical mutex-using code shouldn't be affected by the specification
change (assuming that glibc/libgomp/libitm are representative).
Cons:
* Change to futex specs.
=== Workaround 1a: New FUTEX_WAKE_SPURIOUS operation that avoids the
specification change
This is like Workaround 1, except that the kernel could add a new futex
op that works like FUTEX_WAKE except that:
* FUTEX_WAITs woken up by a FUTEX_WAKE_SPURIOUS will always return
EINTR. EINTR for spurious wakeups is already part of the spec, so
correct futex users are already handling this (e.g., glibc does).
* Make sure (and specify) that FUTEX_WAKE_SPURIOUS that hit other
futexes (e.g., PI) are ignored and don't cause wake-ups (or just benign
spurious wakeups already specified).
Users of FUTEX_WAKE_SPURIOUS should have to do very little compared to
when using FUTEX_WAKE. The only thing that they don't have anymore is
the ability to distinguish between a real wakeup and a spurious one.
Single-use FUTEX_WAITs could be affected, but we don't have them in
glibc. The only other benefit from being able to distinguish between
real and spurious is in combination with a timeout: If the wake-up is
real on a single-use futex, there's no need to check timeouts again.
But will programs want to use this often, and will they need to have to
use FUTEX_WAKE_SPURIOUS in this case? I guess not.
Pros:
* Correct futex uses will need no changes.
Cons:
* Needs a new futex operation.
=== Workaround 2: New FUTEX_UNLOCK operation that makes resetting the
futex var and unblocking other threads atomic wrt. other FUTEX_WAIT ops
This is like UNLOCK_PI, except for not doing PI. Variations of this new
futex op could store user-supplied values, or do a compare-and-set (or
similar) read-modify-write operations. FUTEX_WAKE_OP could be used as
well, but we don't need to wake another futex (unnecessary overhead).
(I haven't checked the kernel's FUTEX_WAKE_OP implementation, and there
might be reasons why it can't be used as is (e.g., due to how things are
locked regarding the second futex).
The biggest performance drawback that I see is a potential increase in
the latency of unlocking any thread (blocked *or* spinning) when any
thread is blocked. This is because we'll have to ask the kernel to
reset the futex var (instead of, like now, userspace doing it), which
means that we'll have to enter the kernel first before a spinning thread
can get the okay to acquire the lock. This could decrease lock
scalability for short critical sections in particular because those
effectively get longer.
I don't think it's sufficient to merely count the number of waiting
blocked threads in the futex var to get around pending FUTEX_WAKE calls.
If there is *any* potentially blocked waiter, we'll have to use the
kernel to reset the futex var.
Perhaps this could be mitigated if we'd do a lot more spinning in the
futexes, so that it's unlikely to slow down spinning waiters just
because there's some blocked thread. For blocked threads, the slow down
should be less because if a waiter is going to block, there's just a
small time window where the FUTEX_WAIT will actually fail (EWOULDBLOCK)
due to the futex var changing concurrently.
Pros:
* Correct futex uses will need no changes.
Cons:
* glibc implementation will have to change (mutexes, semaphores,
barriers, perhaps condvars).
* Needs a new futex operation (or might use FUTEX_WAKE_OP with some
performance penalty).
* Potential decrease in lock scalability unless it can be mitigated by
more aggressive spinning.
=== Workaround 3: Tell mutex destruction about pending FUTEX_WAKEs
This aims at enabling destruction to wait for pending FUTEX_WAKEs to
finish. There are two ways to keep this bit (or a mix of both):
1) In the futex var or mutex itself. Might lead to more memory
contention on contended mutexes because there are more modifications to
the mutex than currently (and concurrently with any spinning).
2) In thread-local flags used for pending FUTEX_WAKEs to any mutex
(there can only be one in flight concurrently per thread). This
requires mutex destruction to scan through the flags of all threads.
If destruction would have to actually block on pending FUTEX_WAKEs to
finish (which I think we should do; just spinning might be too risky
regarding latencies), then this would need to happen using a futex in
non-reused memory to avoid the same futex problem from happening
elsewhere.
Pros:
* No changes to futexes.
Cons:
* Needs changes in glibc for most uses of FUTEX_WAKE.
* Performance concerns for either contended mutexes or mutex
destruction.
=== Workaround 4: Only use futexes in non-reused memory.
This doesn't avoid the pending FUTEX_WAKEs, but makes them manageable by
redirecting any blocking on things like mutexes to separate blocking
constructs. For example, we could have per-thread futexes for blocking
(each thread will only block on one thing concurrently) in memory that
won't be reused for anything but such futexes. This could require some
more work at thread destruction, but should be doable I guess.
The normal synchronization constructs would then install pointers to
those futexes (or the enclosing data structures) when they actually
block, and potentially build lists/queues of those. Thus, this would
kind of duplicate what the kernel is doing for futexes, but we'd get all
the flexibility we'd need to build whatever we need (e.g., to solve the
pending FUTEX_WAKE problem, to have 64b futexes, etc.), and optimize how
we need it (and using the information that glibc might have, but the
kernel doesn't have).
The performance implications of this are nontrivial to estimate. On the
one hand, the fast paths (spinning) aren't affected, and we'd have more
control over spinning and wakeups as with the current scheme of doing
mostly (single-)futex-var-based synchronization. On the other hand, if
we take over prioritization of what thread to unblock first, this also
means that we're taking flexibility and choice away from the kernel, so
the kernel can't look at its information about scheduling to actually
make a choice of what to unblock (unless we add new futex ops).
Similar approaches to blocking do work for other languages, so it should
be doable.
Fitting PI into this could be nontrivial, though (but I haven't really
thought about this aspect, so it's just a guess that it might be).
Pros:
* No changes to futexes.
* Lots of control for userspace.
Cons:
* Lots of things would have to be adopted in how glibc does
synchronization.
=== Summary / RFC
IMO, workarounds 1, 1a, or 2 might be the best ones, although 4 might
also be good.
Do you see any other options, or have comments on those above?
We have some time to decide I think, but we need to start building an
agreed-upon direction now, because it affects plenty of glibc
synchronization. For example, before looking at improving rwlock
scalability, it would be good to have a selection for the
destruction/futex problem.
Thanks in advance for any feedback.