This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [RFC] mutex destruction (#13690): problem description and workarounds

From: "Carlos O'Donell" <carlos at redhat dot com>
To: Torvald Riegel <triegel at redhat dot com>, GLIBC Devel <libc-alpha at sourceware dot org>
Date: Wed, 26 Nov 2014 14:08:04 -0500
Subject: Re: [RFC] mutex destruction (#13690): problem description and workarounds
Authentication-results: sourceware.org; auth=none
References: <1396621230 dot 10643 dot 7191 dot camel at triegel dot csb>
On 04/04/2014 10:20 AM, Torvald Riegel wrote:
> The problem at hand is that the glibc implementation assumes that
> mutexes can be destroyed as soon as no thread is referencing a mutex
> anymore in the sense of no function calls that use the mutex being
> executed anymore.

Is the difficulty with the definition the meaning of referenced?
There is no way to talk about referenced with respect to other
operations, when does the reference start, when does it end etc.

> This seems to be allowed by the POSIX specs based on current wording.
> However, POSIX also has a non-normative example that relies on a
> stronger guarantee (or, weaker precondition) for mutex destruction: That
> a mutex_unlock does not "reference" a mutex anymore once it can be
> locked by another thread.  When relying on this stronger guarantee,
> programmers can destroy mutexes as soon as they can lock and unlock
> them.  The clarification request with the Austin Group is still pending:
> http://austingroupbugs.net/view.php?id=811

The clarification was made as of June 19th 2014.

> Nonetheless, there's some indication that they will clarify towards the
> stronger guarantee (so that the example would work).  These semantics
> are also required by C++11 mutexes.  C11 doesn't really specify this
> (the condition that it gives is insufficient in that it doesn't cover
> other cases either).

OK.

> Thus, I think it's useful to start thinking what we're going to do when
> we get the stronger guarantee. Likewise, if we want to (continue to)
> implement C++11 mutexes on top of pthread mutexes, we'll have to do
> something.

We have the stronger guarantee, so I think we can move forward.
 
> === Details of the problem
> 
> #13690 reports on possible accesses to the memory previously occupied by
> a destroyed mutex, which are illegal assuming the stronger guarantee.
> What happens is that the mutex implementation looks at the mutex kind
> after it has unlocked the mutex; it does so to compute the right
> arguments to a futex wake-up call.
> If a thread destroys a mutex, and then reuses this piece of memory, we
> will thus have (read) accesses to either other state or
> not-mapped/mprotected memory, potentially resulting in segfaults.

Agreed.

> This first sub-issue would be straightforward to fix, but it's actually
> the smaller issue here.  The bigger one is that we have a pending
> FUTEX_WAKE syscall on a memory location that can be reused for something
> else.  "Pending" here means that a thread can unlock the futex var (ie,
> mutex.__data.__lock) and thus make the mutex available to be locked
> and/or destructed by other threads, be suspended for whatever reason,
> and then at any time later, resume and issue the futex syscall.

This is always the problem with a lock that protects access to the
destruction of itself.

One can quickly grow a lock hierarchy which ends in an unfreeable object
and use that to control access to freeing and allocating mutexes, but
that way doesn't scale IMO.

AFAIK the only good solutions are to GC the objects or change the semantics
of the API.

> If the memory isn't accessible anymore (e.g., unmapped), the pending
> futex syscall will just fail and nothing bad happens.  Likewise, if the
> memory location is at the later time not associated with any other
> futex, then nothing bad will happen.  However, if the memory is reused
> for another futex, then we'll get a spurious FUTEX_WAKE call for
> completely unrelated futexes.

Right.

> This problem is not specific to mutexes, but to the combination of the
> design of the non-PI futex operations and memory reuse.  There are
> similar bugs for semaphores (#12674) and barriers (#13605, but there are
> further constraints on when destruction is allowed, which are related to
> signals).
> 
> PI mutexes are not affected because there, the kernel resets the futex
> var in FUTEX_UNLOCK_PI atomically wrt. any FUTEX_LOCK_PI calls; IOW, if
> there are threads waiting for the lock that might need to be unblocked,
> there will be no pending futex wake-up calls.

OK.
 
> === Workaround 1: Specify that spurious FUTEX_WAKE calls are possible
> 
> FUTEX_WAIT is currently specified to return 0 if it got woken by a
> FUTEX_WAKE call, and EINTR when wake-up happened due to a signal or
> other spurious wakeup.  The memory reuse thus possibly allows for
> unrelated FUTEX_WAKE calls being injected into the code running a new
> futex after memory reuse.  If this code relies on the FUTEX_WAIT return
> values to be correct (just regarding 0 vs. EINTR!), the pending
> unrelated FUTEX_WAKE calls can break it.
> 
> However, the existence of such code should be fairly unlikely:
> * All futexes that are used for multiple wait/wake-up cycles (like
> pthread mutexes, barriers, semaphores, ...) will have to be robust to
> their *own* pending FUTEX_WAKEs anyway because this is how futexes are
> designed.
> * PI futexes do not seem to be woken by FUTEX_WAIT according to some
> quick testing: FUTEX_LOCK_PI simply doesn't wake up and FUTEX_WAIT
> returns success, whereas FUTEX_WAIT_REQUEUE_PI isn't woken either but
> FUTEX_WAIT returns EINVAL (likely due to the other mutex being passed in
> being NULL (which we'd need to perhaps check in the glibc code)).  I
> wasn't able to find any specification for these cases, however.
> 
> I've reviewed glibc uses of FUTEX_WAIT (x86_64 and/or generic) and they
> should all be robust to pending FUTEX_WAKEs.  Likewise for GCC's libgomp
> and libitm.  I suppose that cases where there might be pending
> FUTEX_CMP_REQUEUE calls (e.g., condvars) should be containable by using
> a lock (and FUTEX_WAKE) at destruction.
> 
> The case that could be affected by such a specification change is thus
> single-wake-up futexes in programs that themselves never make those
> futexes subject to the pending-FUTEX_WAIT-while-memory-reuse problem.
> (I'm mentioning the latter because it's really not just glibc that's
> affected, but futex users in general.  So, arguably, correct programs
> that use futexes have to deal with the issue already.)
> 
> Pros:
> * Little code change required in glibc to fix the problems.
> * Typical mutex-using code shouldn't be affected by the specification
> change (assuming that glibc/libgomp/libitm are representative).
> 
> Cons:
> * Change to futex specs.

This is a "relaxed semantic" workaround.

I don't like workaround 1 because it changes the semantics of the
specification in ways that might adversely affect users. I am hesitant
to remove the potentially useful and semantically simple non-spurious
case. Despite the fact that Rich comments that there is a kernel bug,
it might just be a kernel bug that we get a spurious wake.
 
> === Workaround 1a: New FUTEX_WAKE_SPURIOUS operation that avoids the
> specification change
> 
> This is like Workaround 1, except that the kernel could add a new futex
> op that works like FUTEX_WAKE except that:
> * FUTEX_WAITs woken up by a FUTEX_WAKE_SPURIOUS will always return
> EINTR.  EINTR for spurious wakeups is already part of the spec, so
> correct futex users are already handling this (e.g., glibc does).
> * Make sure (and specify) that FUTEX_WAKE_SPURIOUS that hit other
> futexes (e.g., PI) are ignored and don't cause wake-ups (or just benign
> spurious wakeups already specified).
> 
> Users of FUTEX_WAKE_SPURIOUS should have to do very little compared to
> when using FUTEX_WAKE.  The only thing that they don't have anymore is
> the ability to distinguish between a real wakeup and a spurious one.
> Single-use FUTEX_WAITs could be affected, but we don't have them in
> glibc.  The only other benefit from being able to distinguish between
> real and spurious is in combination with a timeout: If the wake-up is
> real on a single-use futex, there's no need to check timeouts again.
> But will programs want to use this often, and will they need to have to
> use FUTEX_WAKE_SPURIOUS in this case?  I guess not.
> 
> Pros:
> * Correct futex uses will need no changes.
> Cons:
> * Needs a new futex operation.

This is a "new semantic" workaround.

I like this option the best. It isolates the change to a new operation
and that operation allows a specific userspace implementation to have
the wakeup always return EINTR such that a wakeup that carries over to
a new futex (reusing the memory) doesn't think it was woken correctly.

> === Workaround 2: New FUTEX_UNLOCK operation that makes resetting the
> futex var and unblocking other threads atomic wrt. other FUTEX_WAIT ops
> 
> This is like UNLOCK_PI, except for not doing PI.  Variations of this new
> futex op could store user-supplied values, or do a compare-and-set (or
> similar) read-modify-write operations.  FUTEX_WAKE_OP could be used as
> well, but we don't need to wake another futex (unnecessary overhead).
> (I haven't checked the kernel's FUTEX_WAKE_OP implementation, and there
> might be reasons why it can't be used as is (e.g., due to how things are
> locked regarding the second futex).
> 
> The biggest performance drawback that I see is a potential increase in
> the latency of unlocking any thread (blocked *or* spinning) when any
> thread is blocked.  This is because we'll have to ask the kernel to
> reset the futex var (instead of, like now, userspace doing it), which
> means that we'll have to enter the kernel first before a spinning thread
> can get the okay to acquire the lock.  This could decrease lock
> scalability for short critical sections in particular because those
> effectively get longer.
> I don't think it's sufficient to merely count the number of waiting
> blocked threads in the futex var to get around pending FUTEX_WAKE calls.
> If there is *any* potentially blocked waiter, we'll have to use the
> kernel to reset the futex var.
> Perhaps this could be mitigated if we'd do a lot more spinning in the
> futexes, so that it's unlikely to slow down spinning waiters just
> because there's some blocked thread.  For blocked threads, the slow down
> should be less because if a waiter is going to block, there's just a
> small time window where the FUTEX_WAIT will actually fail (EWOULDBLOCK)
> due to the futex var changing concurrently.
> 
> Pros:
> * Correct futex uses will need no changes.
> Cons:
> * glibc implementation will have to change (mutexes, semaphores,
> barriers, perhaps condvars).
> * Needs a new futex operation (or might use FUTEX_WAKE_OP with some
> performance penalty).
> * Potential decrease in lock scalability unless it can be mitigated by
> more aggressive spinning.

This is a "strict semantic" workaround, and the kernel does the work.

I don't like this solution. The performance impact is not worth it given
the other workarounds. However, without any performance measurement I don't
know exactly how bad this is, but entering the kernel is bad enough that
we don't want to consider it.

> === Workaround 3: Tell mutex destruction about pending FUTEX_WAKEs
> 
> This aims at enabling destruction to wait for pending FUTEX_WAKEs to
> finish.  There are two ways to keep this bit (or a mix of both):
> 1) In the futex var or mutex itself.  Might lead to more memory
> contention on contended mutexes because there are more modifications to
> the mutex than currently (and concurrently with any spinning).
> 2) In thread-local flags used for pending FUTEX_WAKEs to any mutex
> (there can only be one in flight concurrently per thread).  This
> requires mutex destruction to scan through the flags of all threads.
> 
> If destruction would have to actually block on pending FUTEX_WAKEs to
> finish (which I think we should do; just spinning might be too risky
> regarding latencies), then this would need to happen using a futex in
> non-reused memory to avoid the same futex problem from happening
> elsewhere.
> 
> Pros:
> * No changes to futexes.
> Cons:
> * Needs changes in glibc for most uses of FUTEX_WAKE.
> * Performance concerns for either contended mutexes or mutex
> destruction.

This is a "garbage collection" workaround in disguise.

The performance impact of this worries me. We would need to do detailed
measurements of the impact, and the solution as you pose it is incomplete.
I think the devil will be in the details of making this case work well.

Rich makes a good point that there is a case where the mutex was
created with pshared==1, is non-private, and is destroyed by another
thread. In this case one thread unlocks the mutex, while another thread
unmaps the memory, then another thread mmaps some other memory back
(with a different futex in the same place), and the wake touches that
futex. Therefore you never called pthread_mutex_destroy, but such
operations are allowed for shared mutexes.
 
> === Workaround 4: Only use futexes in non-reused memory.
> 
> This doesn't avoid the pending FUTEX_WAKEs, but makes them manageable by
> redirecting any blocking on things like mutexes to separate blocking
> constructs.  For example, we could have per-thread futexes for blocking
> (each thread will only block on one thing concurrently) in memory that
> won't be reused for anything but such futexes.  This could require some
> more work at thread destruction, but should be doable I guess.
> 
> The normal synchronization constructs would then install pointers to
> those futexes (or the enclosing data structures) when they actually
> block, and potentially build lists/queues of those.  Thus, this would
> kind of duplicate what the kernel is doing for futexes, but we'd get all
> the flexibility we'd need to build whatever we need (e.g., to solve the
> pending FUTEX_WAKE problem, to have 64b futexes, etc.), and optimize how
> we need it (and using the information that glibc might have, but the
> kernel doesn't have).
> 
> The performance implications of this are nontrivial to estimate. On the
> one hand, the fast paths (spinning) aren't affected, and we'd have more
> control over spinning and wakeups as with the current scheme of doing
> mostly (single-)futex-var-based synchronization.  On the other hand, if
> we take over prioritization of what thread to unblock first, this also
> means that we're taking flexibility and choice away from the kernel, so
> the kernel can't look at its information about scheduling to actually
> make a choice of what to unblock (unless we add new futex ops).
> 
> Similar approaches to blocking do work for other languages, so it should
> be doable.
> 
> Fitting PI into this could be nontrivial, though (but I haven't really
> thought about this aspect, so it's just a guess that it might be).
> 
> Pros:
> * No changes to futexes.
> * Lots of control for userspace.
> Cons:
> * Lots of things would have to be adopted in how glibc does
> synchronization.

This is a "garbage collection" solution. I don't like it because it
doesn't fix this problem for any other free software projects that use
futexes. I would like glibc to pioneer a more general solution that can
be adopted by other projects. We have the expert knowledge to come up
with a solution that can be used easily by everyone in userspace.
 
> === Summary / RFC
> 
> IMO, workarounds 1, 1a, or 2 might be the best ones, although 4 might
> also be good.

I vote 2. I don't want to relax the semantics of the original operations
to support spurious wakes. Such spurious wakes might be fixed in future
kernels and maybe other applications can start depending on them even if
glibc doesn't.

> Do you see any other options, or have comments on those above?

Cheers,
Carlos.
Follow-Ups:
- Re: [RFC] mutex destruction (#13690): problem description and workarounds
  - From: Torvald Riegel
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]