Lock elision implementation guidelines

This wiki page documents the design guidelines for implementations of lock elision in glibc. Currently, the focus is on lock elision for the POSIX Threads locks provided to users.

Lock elision (LE) is an optimization for critical sections (CS) that allows for virtually mutually exclusive execution of CS without having to modify the lock protecting the CS. This can improve performance by avoiding cache misses on the lock itself and allowing to exploit parallelism between concurrent CS, but requires that concurrent CS are likely to not have data conflicts with each other (i.e., that they don't access the same memory locations concurrently, with at least one access being a write). LE is typically supposed to be transparent to the program because it continues to synchronize using locks. LE is typically implemented using Hardware Transactional Memory (HTM) or similar hardware support. See Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution by Rajwar and Goodman, MICRO 2001, for more information.

LE usage scenarios

There are a several use cases for LE, which we have to distinguish because they result in somewhat different requirements regarding how LE should be exposed, and how users can make use of it.

  1. In production, no custom tuning (PNT)
    • Likely the most common case.
    • What we'd like to have in the long term (i.e., there's no need to tune).
    • Lock elision is transparent to the program and increases performance where possible.
  2. In production, with custom tuning (PT)
    • There's enough benefit of tuning to users to justify the effort.
    • LE is mostly transparent to the program, but there are tuning hints.
    • The tuning interfaces and parameters should probably be stable.
  3. Experimenting with LE (E)
    • Users that want to play around with LE, override any glibc-internal tuning, etc.
    • Tuning interfaces and parameters don't need to be stable.
    • LE doesn't need to be transparent to programs.
    • Might be important in the early stages of lock elision support, but should become less important in the long term.
  4. glibc-internal uses
    • LE for locks used to protect CS in glibc code (i.e., no user code is run during those CS).
    • Here, we can do whatever works for glibc code.

LE semantics

LE is a lock implementation optimization, and should be transparent to programs; that is, locks must provide the same semantics independently of whether LE is used or not. Note that this does not require LE to be transparent in incorrectly synchronized programs.

While we can use locks with any useful semantics internally in glibc, lock elision in the lock implementations that we provide to users needs to be compatible with what POSIX requires. This is necessary to use LE as widely as possible and in (correct) legacy code, and to make the PNT usage scenario common. The lock semantics specified in C11 and C++11 are more friendly to lock elision, so it might make sense to provide additional nonstandard mutex kinds for those.

LE implementation on HTMs

Some architectures provide full hardware support for LE (e.g., Intel's HLE), but LE can also be implemented with HTMs like Intel's RTM. The main functional difference is that with the former, a thread that acquired a lock with elision enabled will also see a change in the memory used for this lock.

HTMs need a different LE implementation because while the transactions provided by the HTM take care of the data conflict detection between CS, we cannot let transactions write to the lock because this would abort other transactions that want to elide the same lock.

lock()

When a thread does not elide the lock in question, it simply acquires the lock by changing its state in shared memory. When using elision, threads use the HTM to monitor that the lock was never changed and was not acquired (by starting a transaction, reading the lock, and checking the value). As a result, all lock-eliding threads use transactions to order their CS; once one thread acquires the lock without elision, this will abort all lock-eliding transactions because the nontransactional write to the lock during acquisition conflicts with the transactional reads of the lock.

(Note that this discussion assumes an HTM that either aborts transactions eagerly on conflicts, or never lets side effects of uncommitted transactions become visible to other threads.)

The transactional loads of the lock must have acquire memory order (as specified by C11), similar to the acquire memory order used in normal lock acquisitions. (This is imprecise; details depend on the HTM's memory model.)

unlock()

A lock-eliding thread can't know whether it acquired the mutex by just looking at the lock, because it just monitors it but didn't modify it. However, if we assume that the program only releases locks it previously acquired, then a thread can end the current transaction when unlock() is called for a lock that is not acquired. This can only happen when the thread used elision when acquiring this lock, and the atomicity provided by the transaction ensures that the lock was never acquired by another non-eliding thread in the meantime.

Compability of lock()/unlock() with pthreads mutex types

This scheme allows a thread to acquire a lock that it has already acquired, and there is no error checking when releasing a lock. Thus, for which pthreads mutex types is it a suitable implementation?:

PTHREAD_MUTEX_DEFAULT
Yes. Recursive locking or releasing an unlocked mutex result in undefined behavior, so the implementation is free to do anything in such cases.
PTHREAD_MUTEX_NORMAL
Strictly speaking, no. Allows undefined behavior on releasing an unlocked mutex, but requires deadlock when trying to lock an acquired lock. Deadlock is not the same as undefined behavior: even though there's no progress on deadlock, nothing bad will happen either.
  • In a buggy program, the only benefit of this kind of mutex over the default ones is that it gives fail-stop behavior. But that doesn't give programs a lot:
    • The mutexes aren't required to be cancellation points. Only rwlock might contain cancellation points. We can't just kill threads.
    • To find out whether there was a deadlock, the program would have to explicitly track which mutexes were acquired and are about to be acquired. POSIX doesn't let you query whether a mutex is part of a deadlock. If the program is tracking this anyway, why would it not use lock acquisitions with timeouts to prevent any deadlocks right away?
  • However, relying on the deadlock is allowed and might be used (thanks to Rich Felker for these examples):
    • For a permanent thread that does nothing but handle signals using async-signal-safe functions, pthread_mutex_lock(&m); pthread_mutex_lock(&m); is just as valid as for(;;)pause();

    • Self-deadlock would be one obvious way to make a thread preserve the lifetime of all its presantly extant automatic objects until program exit.
    • At least calling exit() should actually shut down the program even if a thread is just deadlocked on a particular mutex (and doesn't also hold other locks that might be needed for shutdown).
  • Currently, PTHREAD_MUTEX_NORMAL is the same type as PTHREAD_MUTEX_DEFAULT in current glibc (i.e., uses the same constant). However, PTHREAD_MUTEX_NORMAL mutexes can only be created by calls to pthread_mutexattr_settype; static initializers exist only for PTHREAD_MUTEX_NORMAL mutexes. Thus, we can use this to distinguish them by assuming that every mutex initializes via pthread_mutexattr_settype is of type PTHREAD_MUTEX_NORMAL, and statically initialized mutexes are of type PTHREAD_MUTEX_DEFAULT. We could give PTHREAD_MUTEX_NORMAL a new constant, but it doesn't seem necessary to do this now because it would enable elision in a case that seems unlikely (i.e., explicitly using pthread_mutexattr_settype to create a PTHREAD_MUTEX_DEFAULT mutex even though pthread_mutexattr_init already sets all the attributes to the implementation-defined defaults).
PTHREAD_MUTEX_ERRORCHECK
No, requires error checking.
PTHREAD_MUTEX_RECURSIVE

No. While correctly-used recursive lock/unlock do work, this mutex type is specified to return an error if a thread unlocks a mutex that is not acquired or not acquired by this thread.

PTHREAD_MUTEX_ADAPTIVE_NP

Same guarantees as PTHREAD_MUTEX_NORMAL according to linxuthreads.texi.

PTHREAD_MUTEX_FAST_NP
Same as PTHREAD_MUTEX_ADAPTIVE_NP.
PTHREAD_MUTEX_TIMED_NP

According to linxuthreads.texi, this isn't documented to behave like PTHREADS_MUTEX_NORMAL (but it would be safe to do so). Thus, we could treat it like PTHREAD_MUTEX_DEFAULT but would then need to introduce a new constant for this type (it has the same constant aliasing issue as PTHREADS_MUTEX_NORMAL).

Mutex trylock() and alternative LE implementations

With the monitor-the-lock implementation scheme described previously, there is no way to reliably detect whether a lock has already been acquired by a thread. This makes it unsuitable for most mutex types, and also has implications for trylock(): For recursive mutexes, trylock() will just acquire the lock, as expected; for nonrecursive mutexes, trylock() must return an error if the mutex is already acquired. This doesn't work in all cases:

There are a few potential solutions:

  1. Track which mutexes have been acquired (in software)
    • Can't track this using bits in the lock itself or in shared data, or the write will prevent elision due to data conflicts with other transactions. Could track using TLS.
    • Con: larger runtime overheads and increases the amount of data transactions touch (important because an HTM's access tracking capacity is limited). Could still increase scalability of contended CS, but overheads for uncontended.
    • Pro: allows elision to be used for all mutex types.
  2. trylock() on non-recursive mutexes aborts transactions/elision
    • Con: bad for custom spinlocks in programs built on top of trylock()
    • Pro: Simple to implement.
  3. Add a new kind of trylock() with different semantics; use option 2 for the original trylock():
    1. possibly_recursive_trylock(): would behave like a recursive mutex if used with elision and like a nonrecursive mutex' trylock() if used without elision on a nonrecursive mutex.
      • Con: Adds a new kind of trylock(). Unclear whether it's a useful addition in the long term because it's just used to support a certain kind of HW support.
      • Con: The semantics are cumbersome. The program can't detect whether elision is currently being used in a clean way, and it can't force it either.
      B. nonrecursive_trylock(): has undefined behavior when used to acquire a mutex recursively -- the program must never do that.
      • Con: Adds a new kind of trylock().
      • Con: Adds more undefined behavior.
      • Pro: Semantics are clearer, and might be practical in the context of typical uses of nonrecursive mutex types.
      • Pro: Would satisfy the C++11 trylock requirements for C++ nonrecursive mutexes, so this has use cases.
  4. New nonrecursive mutex types with different trylock() semantics. (Combined with option 1, 2, or 3 for the existing types.)
    • New types would get trylock() behavior like for nonrecursive_trylock() in option 3B.
    • Con: Needs code review when trying to exploit this in legacy code.
    • Con: Yet another mutex type. Unclear whether it's a useful addition in the long term because it's just used to support a certain kind of HW support.
    • Con: New type might be associated with enabling LE explicitly. Bad because this mixes tuning decisions with types used to distinguish semantics.
    • Pro: Would satisfy the C++11 trylock requirements for C++ nonrecursive mutexes, so this has use cases.
  5. Advise programmers to switch from nonrecursive to recursive mutexes when they want trylock() to work well with elision. (Combined with option 2.)
    • Con: Still need a solution to the error checking requirement of PTHREAD_MUTEX_RECURSIVE, which would prevent using LE for recursive mutexes otherwise.
    • Con: Programmers need to change their code, we need to educate them about this.
    • Pro: No new mutex type.
    • Pro: Code review is straightforward for existing legacy code (e.g., if no trylock is used and program is deadlock-free, a nonrecursive mutex can be change to a recursive one). Code review should never be more complex than what would be required for option 4).

It seems that option 2 is the least bad, and is a safe approach. Option 1 likely has performance overheads that make it impractical. Options 3 to 5 do not benefit uses of the current mutex types / functions.

Compability with pthreads rwlock

wrlock() / trywrlock()
No. wrlock() is required to either (1) block on an already locked rwlock or (2) to return EDEADLK in this case (which we can't because we cannot detect whether we already acquired the lock). Thus, we have effectively the same case as with PTHREAD_MUTEX_NORMAL. Likewise, rdlock() is required to block (or return EDEADLK) if any thread already holds a write lock, and tryrdlock() is required to return an error in this case.
rdlock() / tryrdlock()
Yes, provided that elision is not used for wrlock() and trywrlock(). A thread can acquire the read lock several times, but it is also required to call unlock() a matching number of times (i.e., this is different from the return code requirement for PTHREAD_MUTEX_RECURSIVE).

C11/C++11 mutex semantics

C++11's preconditions for mutex operations require that a thread owns a mutex before unlocking it and that lock()/trylock() aren't called for nonrecursive mutexes already owned by the calling thread. Thus, it would be possible to use the HTM-based LE implementation scheme for them. However, mapping C++11 recursive mutexes to PTHREAD_MUTEX_RECURSIVE wouldn't help because the latter has the error checking requirement. ??? Is that another reason to add a new/modified recursive mutex type?

C11's lock()/unlock() have similar preconditions. trylock() does not have the precondition that a nonrecursive mutex must not be locked by the same thread. Thus, apart from trylock(), we could use LE for those mutexes (but the problem for C++11 recursive mutexes applies here too). ??? Or do lock()/unlock() need to return thrd_error when the preconditions do not hold?

LE tuning

Tuning granularities

We can make tuning decisions at different granularities. Starting with the most fine-granular:

  1. per CS: estimate covering the properties of this CS and all concurrently executing CS
  2. per lock: estimate covering all CS protected by this lock
  3. process-wide: estimate covering all locks
  4. implementation defaults

A CS affects whether LE can be successful because it uses the HTM/LE HW resources in a certain way (e.g., accesses a certain number of cachelines which might or might not be larger than the HTM's access tracking capacity). Other CS that execute concurrently with the CS also affect LE success probability because they determine whether there are data conflicts, and whether the lock is acquired without elision or not. (Note that concurrent non-CS code could also cause data conflicts.)

The more coarse-granular the tuning gets, the harder it becomes to make a good tuning decision in heterogeneous workloads. For example, if both large and short CS are using a common lock to protect access to data (e.g., resize and query operations in a hash table), then LE might work for the short operations but not for the larger ones.

Portability of tuning decisions

Whether LE can be successful depends on:

  1. Properties of the HW mechanism (e.g., HTMs) used to implement LE: capacity, ...
  2. Properties of the program: how much data it accesses in a CS, the probability of data conflicts between concurrent CS, ...

Both aren't strictly independent:

Note that the HW properties aren't necessarily architecture-specific but could also be machine-model-specific given that HTMs are a very new HW feature, and thus HW support could evolve over time. Therefore, portability of tuning decisions across different models of an architecture, or across different architectures can be a problem.

Tuning decisions that essentially reveal properties of the program can be more likely to work well across different machines because they do not mix in assumptions about the HW properties. In contrast, tuning decisions like binary use-LE/do-not-use-LE flags might be much harder to maintain over a longer period of time.

??? Which tuning parameters are widely useful, and/or likely to be meaningful for future HW too?

Possible LE tuning road maps

The LE implementation is still experimental. We need intermediate steps until we can fully exploit its potential. (E, PNT, PT denote the use cases explained above.) Candidates:

  1. Focus on experimentation (E)
    • Disable LE by default. Provide tuning knobs that are not part of the ABI.
    • Assumption: Might be too early to do anything meaningful for PNT or PT
    • Pro: no harm done by experimental LE implementations
    • Con: no benefits beyond E :(

  2. Focus on PNT, allow E
    • Enable LE conservatively. Provide tuning knobs that are not part of the ABI.
    • Pro: Hopefully little harm done by LE implementation deficiencies (need to be conservative with semantics, but performance tuning doesn't need to perfect right away).
    • Con: No stable way to tune LE until a later time.
  3. Expect tuning decisions right away and provide stable interfaces for this:
    • Focus on PT; PNT might or might not work, including having signficant negative effects on peformance.
    • Provide stable interface for PT tuning. Perhaps provide new, modified mutex types too.
    • Assumption: LE only works on new HW, and it's okay to require tuning for new HW.
    • Pro: Supports PT and early adopters that want to use LE in an in-production setting. Doesn't preclude E.
    • Con: Large risk of introducing stable interfaces and tuning knobs that might turn out to be not the right ones in the long term. This is likely to be even more severe if new mutex types are added.

None: LockElisionGuide (last edited 2013-07-04 20:56:16 by TorvaldRiegel)