This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Fix elision in nested trylock.


On Thu, Jun 27, 2013 at 10:54:44AM +0200, Torvald Riegel wrote:
> On Thu, 2013-06-27 at 02:41 +0200, Andi Kleen wrote:
> > On Thu, Jun 27, 2013 at 01:10:03AM +0200, Torvald Riegel wrote:
> 
> > Elision enforce the invariant that either all parallel users on
> > a lock are eliding or all users are locking. This is done by putting
> > the lock value into the read-set, so any locker will abort any parallel
> > transactions.
> 
> That's not the case if you consider just a single thread.  Here's an
> example execution for a thread with some interference by another thread:
> 
> Assumptions: mutex->__data.__elision is zero.
> 
> 1) Thread A tries to acquire the lock (via __lll_lock_elision or
> __lll_trylock_elision).  It reads value zero for *try_lock (ie,
> mutex->__data.__elision).  Gets suspended.  No transactions so far.
> 
> 2) Another thread does something with the lock, fails to use elision,
> elision adapts, and mutex->__data.__elision ends up being greater than
> zero.  This doesn't abort thread A.  The other thread is suspended,
> leaving the lock unacquired.
> 
> 3) We (thread A) resume execution.  Start elision for the lock.  We're
> in a transaction now, but we haven't written to the lock.  Return to the
> program, with lock acquired.

Remark: The code either elides the lock _or_ acquires it, not
both at the same time.  In case of elision the lock will appear to
have been acquired only after the transaction completes.
 
> 4) We (thread A) call trylock for the same lock, and we end up in
> __lll_trylock_elision.  We read *try_lock (mutex->__data.__elision), and
> get a value greater than zero.

Remark: This would only happen if the write has been flushed to
the point of unification of the two cpu's caches.  If the write is
still pending, thread A aborts because of a read-after-write
conflict on mutex->__data.__elision (at least on z, I assume it's
the same with Intel).

> We don't use the elision code path, but
> call lll_trylock, which will acquire the lock because nobody has
> acquired it visibly.  This is incorrect; POSIX requires us to return
> that the lock is already acquired.
> 
> So, the problem is that someone else might race with our load of
> mutex->__data.__elision in the outermost transaction and our decision
> based on this load.

Actually, I think your patch is not really complete.  There is a
similar effect that plays a role only when elision for recursive
mutexes is implemented:

  * __elision starts with a value > 0
  * the mutex is locked recursively several times
  * each time one of the inner locks is released, __elision is
    decreased
  * if that happens often enough, the code will switch to using
    elision while the lock has been acquired in the outermost
    invocation.

==> When the outermost phtread_mutex_...lock() is called, it
    should really freeze the value of __elision so that all
    nested calls use the same strategy as the outermost call.

> Similarly, the __elision field can be on a
> different cacheline than __data.__lock, so any pending write by another
> thread to __elision (with a value > 0) can trigger the same bug also in
> other executions.

This can be fixed by just reading __elision right after the
outermost transaction starts, thus putting the cacheline with
__elision into the read set.  But that helps only if the
transaction has already begun, not for the race condition you
described.

P.S.:  Please do not cc me, I read the list.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]