This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Fix elision in nested trylock.


On Thu, 2013-06-27 at 02:41 +0200, Andi Kleen wrote:
> On Thu, Jun 27, 2013 at 01:10:03AM +0200, Torvald Riegel wrote:
> > Attached is a patch that fixes a race condition in how Andi's current
> > patches use elision in trylock.  The elision code path in trylock aborts
> > if it is executed in a transaction, but the non-elision code path does
> > no such thing.  Thus, we need to avoid that a thread acquires a lock by
> > using elision and then executes a nested trylock that doesn't use
> > elision (the nested trylock would violate the POSIX requirements in this
> > case). 
> 
> The POSIX requirement is on the same lock, not different locks.

Yes, and that's where the bug triggers.

> Elision enforce the invariant that either all parallel users on
> a lock are eliding or all users are locking. This is done by putting
> the lock value into the read-set, so any locker will abort any parallel
> transactions.

That's not the case if you consider just a single thread.  Here's an
example execution for a thread with some interference by another thread:

Assumptions: mutex->__data.__elision is zero.

1) Thread A tries to acquire the lock (via __lll_lock_elision or
__lll_trylock_elision).  It reads value zero for *try_lock (ie,
mutex->__data.__elision).  Gets suspended.  No transactions so far.

2) Another thread does something with the lock, fails to use elision,
elision adapts, and mutex->__data.__elision ends up being greater than
zero.  This doesn't abort thread A.  The other thread is suspended,
leaving the lock unacquired.

3) We (thread A) resume execution.  Start elision for the lock.  We're
in a transaction now, but we haven't written to the lock.  Return to the
program, with lock acquired.

4) We (thread A) call trylock for the same lock, and we end up in
__lll_trylock_elision.  We read *try_lock (mutex->__data.__elision), and
get a value greater than zero.  We don't use the elision code path, but
call lll_trylock, which will acquire the lock because nobody has
acquired it visibly.  This is incorrect; POSIX requires us to return
that the lock is already acquired.

So, the problem is that someone else might race with our load of
mutex->__data.__elision in the outermost transaction and our decision
based on this load.  Similarly, the __elision field can be on a
different cacheline than __data.__lock, so any pending write by another
thread to __elision (with a value > 0) can trigger the same bug also in
other executions.


> > To do that, we recheck our elision decision in the transaction,
> > which ensures that the nested trylock will also execute the elision
> > path.  Another option to solve that would be to abort any transaction in
> > the non-elision code path too, but that might affect non-glibc-elision
> > transactions that use locks even more.
> 
> Also "try_lock" has nothing to do with trylocks, it just points
> to the adaption state field.  Adaptation state is always only a hint,
> so if you look at it for a correctness problem it's always not 
> correct.

That might be correct if you ignore the trylock requirements by POSIX
(ie, it should return that the lock is already acquired, even in the
relock case); but if you don't ignore it (ie, in your v12 patch set,
assume "upgraded" in trylock to be nonzero), there is different behavior
based on the value of try_lock / mutex->__data.__elision (ie, we abort
in the elision code path, but don't in the non-elision code path).



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]