This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] improving spinning in adaptive mutexes

On 02/28/2013 08:57 AM, Torvald Riegel wrote:
Adaptive mutexes are a mix between spinlocks and normal mutexes: they
spin for a while trying to get the lock (in userspace), and only resort
to blocking by using futexes when they can't acquire the lock after the
limited amount of spinning.  This is meant to improve scalability and
per-thread overheads especially for smaller critical sections that are
subject to at least some contention.
I thought I read at one time that Solaris didn't to "normal" mutexes -- they were always adaptive. I can't seem to find the reference though.

Regardless, the whole point of adaptive mutexes is to improve scalability and from what it appears glibc's implementation has some undesirable problems.

I've looked at the current implementation (triggered by a former AMD
employee mentioning that the maximum amount of spinning would be much
too low), and there are indeed a few issues.  I'll summarize those in
what follows, and how we could improve it.  In the long term, this could
also be used to improve the default mutexes, I hope.
As a point of reference, didn't that AMD employee claim that the current spin count was less than the L2 miss penalty or something similar? If so, doesn't that dramatically reduce the value of an adaptive mutex?

Overall, this leads to a latency curve that is more or less a step function, and thus not really simple for automatic tuning. Especially when we don't know about the length of CS and the level of contention (ie, how many threads try to acquire the lock).
In general, I don't see that we'll ever have a good indication of the length of a CS and level of contention for the general case.

Even for code where we had that information, we don't have any reasonable way to provide it (that I'm aware of) to help with selecting the right primitives from a performance standpoint.

--- Inefficient spinning

1) No spinning using loads but with just trylock()
- trylock() doesn't first check whether the lock is free using a load,
which can be beneficial in the case of uncontended locks (based on
related measurements I did a few years ago; I don't have current
- If the atomic instruction in trylock() puts the lock's cache line into
exclusive state right away, we cause cache misses with the spinning.
I'll note that the model where we first see if the lock is free, then actually trylock is amenable to probing to gather performance data on lock contention. It's a general structure I was going to suggest for glibc so that we could put in probes that would give us valuable data about lock contention.

However, I think this adjusting the max spinning iterations based on the wrong measurement: - If we have short critical sections, the scheme reduces the spinning count. But why? If they are short, we better spin and grab the benefits. If the max spinning is too large, it won't matter most of the time. However, if it's too short, we don't spin even though we should. (Think of the acquisition latency as a bell curve; if we cut off at the average, we're sort of missing half the benefit.)
Makes sense. If we converge on N, spinning for a slightly longer time is still better than blocking. It's really a range we want to look at and the range is probably defined by how expensive it is to block plus either wakeup time or time to switch in another thread to do useful work.

=== Possible steps to improve this

--- Adaptive mutexes

This list is ordered by what might make most sense to try first:

(1) Improve the spin loop itself: loads instead of just trylock()
In the spin loop, use loads until the lock appears free, then use
trylock.  A relatively simple change, but we should still measure the
impact on several systems and archs to better understand the benefits I
suppose, including in cases such as just two threads (where there is
less contention).
We then probably also need to at least adjust MAX_ADAPTIVE_COUNT.
Obviously given what we've heard about tile, this will have to be controlled by the target somehow. However, it seems like a good thing to do to me.

(1.5) Add more probe points (or similar) to adaptive mutexes? Perhaps we can derive already quite a bit about mutex usage in real programs based on the probes we already have in place and any in-kernel stats, we don't have any spinning-specific probes. Would those be necessary / helpful?
Definitely. And much easier to do if you use the spin on testing the lock, then use trylock once the lock appears free.

(2A) Disable adaptation of max spinning. Change max spinning count? This assumes that we don't have a good understanding of how to adapt this properly (see 2B below) and won't have this in the near future either. So, don't do any adaption. Check that the maximum spinning count makes sense, and go with that. Perhaps the max spinning should be derived from the futex wake-up latency, because it affects the lock acquisition latency curve a lot.
But that seems to lose the value in adaptive mutexes; instead we should look to fix the spinning adaptation.

It doesn't have to be perfect. It may be as simple as looking at the sweet spot for spinning as a range and trimming the range based on a feedback loop.

Obviously we need to fix the max spinning count as well.

(3) Add back-off The back-off implementation itself is simple, but tuning it isn't I think. You need to trade off the chance of reducing contention against the chance of introducing additional latency. Yet some kind of randomized back-off is widely believed to be an essential ingredient. Perhaps we can reuse some of the recent Linux kernel work in this area (
Randomized backoff is goodness. Hell, that's been known for decades.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]