This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 02/10] Add the low level infrastructure for pthreadslock elision with TSX
On Thu, 2013-01-10 at 12:19 -0800, Andi Kleen wrote:
> This patch implements a simple adaptive lock elision algorithm based
> on RTM. It enables elision for the pthread mutexes and rwlocks.
> The algorithm keeps track whether a mutex successfully elides or not,
> and stops eliding for some time when it is not.
Andi, first of all, thanks for working on this.
> When the CPU supports RTM the elision path is automatically tried,
> otherwise any elision is disabled.
> The adaptation algorithm and its tuning is currently preliminary.
> The code adds some checks to the lock fast paths. IFUNC is used
> to select functions with these checks based on the RTM availability.
> All the parameters of the algorithms are tunable using environment
This can be good or bad depending on how you see it. It is good for
people that want to tune. But it also indicates that HLE _needs_ to be
tuned to be worthwhile, which worries me.
I know that that HLE is the first LE HW that we can experiment with (at
least, I'm not aware of any other mainstream CPU having something
similar). And the fact that we can't yet talk about performance is
another obstacle. But we should try really hard to avoid requiring
users to tune our locks, including whether to use HLE or not.
Also, if we want to expose tuning parameters, my gut feeling is that at
least long-term, we'd need something different / better for how to set
those parameters. If we'd expose all tunables in pthreads code to the
same extent and in the same way (e.g., max spin count for adaptive
mutexes), it would be a mess.
So, I don't want to blame your patch for this, but IMHO this is
something the project should investigate and provide guidance on.
> Lock elision can be enabled/disabled using environment variables.
> It can be also enabled or disabled using new lock types for
> mutex and rwlocks.
I think we already have too many mutex types. It's good to provide
choice, but if we'd really optimize our locks, we'd end up with _a lot_
more types. Should we really have one lock type for each of cross-NUMA,
one with more fairness and one without, one that spins and one that
doesn't (yes, that's a current case), and so on? And do we then also
build combinations of all of them and with or without HLE? And what if,
for example, Intel comes up with another HLE implementation that is more
powerful? Will we then have HLE2 initializers? Or HLE_vendorX?
I think we should sit down and decide which locks we really need, and
what they should offer. And then see how HLE fits into it.
For example, can we subsume HLE under adaptive mutexes, or at least
enable it for those by default? They're supposed to be auto-tuned
anyway, and while spinning isn't the same thing as speculative
execution, you could argue that both can go wrong easily but can also
perform much better than what the mainstream programmer might be able to
come up with.
> This patch implements the low level "lll_" code for lock elision.
> Followon patches hook this into the pthread implement.
> Changes with the RTM mutexes:
> Lock elision in pthreads is generally compatible with existing programs.
> There are some obscure exceptions, which are expected to be uncommon.
If it violates anything in the specification, it's a correctness bug,
irrespective how commonly it might be triggered.
Lock elision should be as transparent as possible. We should not change
lock semantics; if we really need to (think trylock perhaps), this
should extend existing interfaces instead of changing the semantics of
Programmers also shouldn't have to change their existing code to benefit
from HLE; otherwise, this kills one of the benefits of using HLE with
existing code compared to, for example, using transactions right away.
> See the manual for more details.
> - A broken program that unlocks a free lock will crash.
I think that's fine for the non-error-checked mutexes, but won't be for
those that check I guess. Nonetheless, crashing isn't really nice, so
if we can avoid it at reasonable cost, we probably should.
Note that C++11 also requires the lock to be owned before an unlock()
call. Same for C11.
> There are ways around this with some tradeoffs (more code in hot paths)
Can you summarize the workarounds and the associated costs?
Both keeping thread-local software counters of the TSX nesting depth and
XTEST (ie, checking whether we're executing as a HW transaction right
now) seem to be options. Are there others?
> This will also happen on systems without RTM with the patchkit.
> I'm still undecided on what approach to take here; have to wait for testing reports.
What kind of tests did you run so far?
> - pthread_mutex_destroy of a lock mutex will not return EBUSY but 0.
That should be fine, as it's a may-fail requirement. You could add code
to write to the lock value itself too to ensure that there will be no
other thread holding the mutex (and thus get a correct return value).
The current pthread_mutex_destroy code already writes to the mutex kind,
so this should have negligible overhead.
> - mutex appears free when elided.
> if (pthread_mutex_trylock(mutex) != 0) do_something
> will not do something when the lock elided.
This needs to be fixed (as Rich Felker already pointed out).
I see that you are using RTM instead of the HLE facilities (ie, XACQUIRE
and XRELEASE) that would give you the expected trylock behavior, which
I assume you do this to be more flexible regarding when to fall back to
Have you tried to use HLE within an RTM transaction? That would give
you the flexilibity and provide the correct trylock behavior. If you do
it for the outermost transaction only, then nesting depth is just
decreased by one (but that would require tracking the nesting depth in
software too to figure out the last lock release that then needs an XEND
instead of an XRELEASE).
Note that C++11 allows trylock to fail to acquire a lock even if it's
not actually owned. That could allow trylock, if executed in a
transaction, to just always return that the lock is busy, which would
eventually fall back to nonspecualtive and thus yield forward progress.
However, I'm not sure whether that would be really useful in practice.
Using speculative trylock on owned locks will in any case lead to an
abort, but if the lock is not owned then this wouldn't do the right
> However note that if the check is an assert it works as expected because the
> assert failure aborts and the region is re-executed non transactionally,
> with the old behaviour.
> - There's also a similar situation with trylock outside the mutex,
> "knowing" that the mutex must be held due to some other condition.
> In this case an assert failure cannot be recovered. This situation is
> usually an existing bug in the program.
Right, I don't see a way how a properly synchronized program (that
doesn't depend on timing magic...) could "know" that the lock must be
> - Same applies to the rwlocks. Some of the return values changes
> (for example there is no EDEADLK for an elided lock, unless it aborts.
> However when elided it will also never deadlock of course)
The EDEADLK case is just a may-fail, so this is fine; as you say, there
is no deadlock in this case . Are there any other changes in semantics
besides wrong return values of trywrlock and tryrdlock?
> - Timing changes, so broken programs that make assumptions about specific timing
> may expose already existing latent problems. Note that these broken programs will
> break in other situations too (loaded system, new faster hardware, compiler
> optimizations etc.)
> Currently elision is enabled by default on systems that support RTM,
> unless explicitely disabled either in the program or by the user.
> Given more experience we can decide if that is a good idea, or if it
> should be opt-in.
> This patch implements the basic infrastructure for elision.
> Open issues:
> - XTEST or not XTEST in unlock, see above.
If the performance difference is small enough, I'd probably go for using
> - There's a obscure race when first setting the elision type the first time.
> - Adaptation for rwlocks
> - Condition variables don't use elision so far
The internal lock used in the condvars would be a good testing ground
> - Better adaptation algorithms
> - Elide spinlocks too?
Given that one spinlocks are probably often used for small critical
sections, I think this would make sense. OTOH, they might also be used
for small and very contended critical sections, which might be a
challenge for the adaptation algorithms.
I'd like us to discuss the semantics and high-level design in more
detail first. Thus, I have no comments yet on the specifics of your
patch, but will review once we have consensus on the design.