This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lock elision problems in glibc-2.18

From: Torvald Riegel <triegel at redhat dot com>
To: libc-alpha at sourceware dot org
Cc: andi at firstfloor dot org
Date: Tue, 17 Sep 2013 15:59:54 +0200
Subject: Re: Lock elision problems in glibc-2.18
Authentication-results: sourceware.org; auth=none
References: <20130823084916 dot GA5506 at linux dot vnet dot ibm dot com> <1378901002 dot 3196 dot 14075 dot camel at triegel dot csb> <1379001387 dot 32370 dot 1509 dot camel at triegel dot csb> <87ioy36hx2 dot fsf at tassilo dot jf dot intel dot com> <1379321886 dot 32370 dot 3665 dot camel at triegel dot csb> <20130916102501 dot GA4391 at linux dot vnet dot ibm dot com> <1379343986 dot 32370 dot 4419 dot camel at triegel dot csb> <20130917104309 dot GA25252 at linux dot vnet dot ibm dot com> <1379415485 dot 32370 dot 5371 dot camel at triegel dot csb> <20130917124954 dot GA1335 at linux dot vnet dot ibm dot com>

On Tue, 2013-09-17 at 14:49 +0200, Dominik Vogt wrote:
> On Tue, Sep 17, 2013 at 12:58:05PM +0200, Torvald Riegel wrote:
> > On Tue, 2013-09-17 at 12:43 +0200, Dominik Vogt wrote:
> > > On Mon, Sep 16, 2013 at 05:06:26PM +0200, Torvald Riegel wrote:
> > > > On Mon, 2013-09-16 at 12:25 +0200, Dominik Vogt wrote:
> > > > > On Mon, Sep 16, 2013 at 10:58:06AM +0200, Torvald Riegel wrote:
> > > > > > On Fri, 2013-09-13 at 23:23 -0700, Andi Kleen wrote:
> > > > > > > Torvald Riegel <triegel@redhat.com> writes:
> > > > > > > 
> > > > > > > >  We need to do this, and if the app
> > > > > > > > misinterprets and *always* keeps retrying, it will hang.
> > > > > > > 
> > > > > > > The app is broken then. Retrying forever is simply not allowed.
> > > > > > > I don't think it makes sense to complicate glibc for this.
> > > 
> > > Anyway, even if the application does not retry forever, this is no
> > > guarantee that the glibc abort handler is ever called.
> > 
> > If it does not retry forever, it eventually will execute
> > nontransactionally.
> 
> That does not mean that it will execute the same code
> non-transactionally.  The user code might do something completely
> different that does not use pthread_mutex_lock at all.
> 
> >  Once it does, we've "peeled off" this layer of the
> > problem.  Thus, eventually, glibc's elision will constitute the
> > outermost transaction, so its abort handlers will be executed if
> > necessary.
> 
> > > The
> > > fallback code might use other mechanisms but pthread_mutex_lock
> > > for protection and thus the mutexes that aborted might not be used
> > > at all.
> > 
> > Right, but if pthread mutexes aren't used in the fallback, why do its
> > abort handlers need to be called?  IOW, in such a case the caller
> > effectively turns off elision.
> 
> The current implementation of lock elision expects that its abort
> handler will eventually be called to do the tuning.  I showed a
> scenario where this will not happen, so the tuning is effectively
> disabled in that scenario.  Let's think about the consequences if
> this happens.

Just to recap, in this scenario the caller both misinterprets abort
codes, and when it falls back to nontransactional execution eventually,
it executes code that doesn't make use of glibc's lock elision.

> The initial design idea of the tuning was (at least as far as I
> understood it) that it should automatically adapat and show
> "sensible" behaviour after a while that at least makes sure that
> there isn't a big performance loss.  Now we have a case where the
> tuning code is never even executed.  This is a contradiction to the
> design idea, so either the design or the implementation needs to be
> fixed (or my idea of what the design is).

I think the scenario above isn't a problem.  In the first phase of it,
when the caller executes transactionally but has to retry, no tuning of
nested code could ever take place unless the caller is aware of *and*
initiating the tuning of the nested code.  This is because we do abort
(for whatever reason), and once we do that we roll back any tuning as
well.  It is unreasonable to assume that any combination of caller and
callee would be integrated in terms of tuning.  It's also impractical to
assume that even the caller knows which callee needs to be tuned --
there's no way to track where the abort happened, and why (unless you
use performance counters or such).

In the second phase, the caller simply chooses to use other callees, and
then it doesn't matter whether glibc's elision tuning is used or not.

The situation that we want to avoid in this case is that if we abort due
to bad tuning (the only possible cause for this should be having to
abort on trylock(); anything else shouldn't make things worse given that
we're executing transactionally already).  But the only way out of this
is for the caller to not use transactional execution, and let callees
tune.  Thus, not misinterpreting abort codes, or being conservative on
explicit aborts, would be the right thing to do.

> Perhaps we should start with writing down the requirements for
> the tuning algorithm.  Just some bullets that come to my mind:
> 
>  * The ultimate goal is to use elision where it helps and to not
>    use it where it harms.
>  * A (possible recurring) training phase where the algorithm
>    performs badly is acceptable.
>  * After a training phase, a certain minimum or average
>    performance must be guaranteed in all cases.  If not, what are
>    the exceptions, and how do we deal with them (e.g. argue that
>    they are irrelevant (reasons), easy to avoid etc.).

Those sound good.

>  * If lock elision is ever to be enabled by default, the tuning
>    algorithm must make sure that no (relevant?) software shows
>    a significant (unacceptable) performance loss.

Possibly.  Although this is certainly a trade-off between average
performance gains and losses in particular cases.   We certainly want to
avoid cases where performance really crashes.

>  * If lock elision is _not_ going to be enabled by default, the
>    performance requirements could be much less strict.

Agreed, but to me the goal should be to try to enable it by default,
unless it's really not giving acceptable performance, or we end up not
being able to exploit it in the form of performance gains.

> I think the implementation of lock elision is more or less fixed
> regarding the use of transactions (on a specific platform).  The
> real freedom lies in the tuning algorithm that decides whether to
> try a transaction or not.

Agreed.  We solved the semantics issues, now it's all about the tuning
and seeing how we do on real-world code.

> > > Thus, the glibc code must not rely on its abort handlers
> > > ever being called.
> > 
> > I think that's not quite a precise characterization.  It can rely on the
> > abort handlers being called it started the outermost transaction.
> 
> Granted.  The current code does not check whether it is going to
> start an outermost transaction.  This can be implemented (Xtest
> for Tsx, Etnd for z), but it does not come for free.
> 
> > It
> > can also expect that callers with enclosing transactions will eventually
> > fall back to nontransactional execution.
> 
> Let's rather say:  Expect that the caller will eventually take
> measures to ensure forward progress.  This includes mechanisms like
> the aforementioned constrained transactions.

I believe that the primary use of constrained transactions is really
small transactions (eg, implementing things like a double-CAS, custom
concurrent algorithms, etc.); calling arbitrary code that you don't
control (eg, glibc) is unlikely to be a successful use.

Follow-Ups:
- Re: Lock elision problems in glibc-2.18
  - From: Dominik Vogt

References:
- Re: Lock elision problems in glibc-2.18
  - From: Torvald Riegel
- Re: Lock elision problems in glibc-2.18
  - From: Torvald Riegel
- Re: Lock elision problems in glibc-2.18
  - From: Andi Kleen
- Re: Lock elision problems in glibc-2.18
  - From: Torvald Riegel
- Re: Lock elision problems in glibc-2.18
  - From: Dominik Vogt
- Re: Lock elision problems in glibc-2.18
  - From: Torvald Riegel
- Re: Lock elision problems in glibc-2.18
  - From: Dominik Vogt
- Re: Lock elision problems in glibc-2.18
  - From: Torvald Riegel
- Re: Lock elision problems in glibc-2.18
  - From: Dominik Vogt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]