This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] powerpc: Spinlock optimization and cleanup

From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
To: libc-alpha at sourceware dot org
Date: Fri, 27 Feb 2015 11:57:27 -0300
Subject: Re: [PATCH] powerpc: Spinlock optimization and cleanup
Authentication-results: sourceware.org; auth=none
References: <54EF91B2 dot 3020704 at linux dot vnet dot ibm dot com> <1425039649 dot 16367 dot 15 dot camel at triegel dot csb>

On 27-02-2015 09:20, Torvald Riegel wrote:
> On Thu, 2015-02-26 at 18:35 -0300, Adhemerval Zanella wrote:
>> This patch optimizes powerpc spinlock implementation by:
>>
>> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>>   first.  The optimization for such case is to avoid the first lwarx
>>   in contention case (which is much more costly than normal loads).
> IOW, the previous code was TAS-and-test-and-TAS loop, and you're
> changing it into a test-and-TAS loop?

If by 'test-and-TAS loop' you mean it is now loop on lwarx. instruction, that
is not the case: the first test is a sniff 'check' on value to avoid the costly
LL/SC instruction that create the cacheline reservation.  The loop over a not
committed lwarx. is still doing using normal load operations.

>
> Are you sure that this gives enough benefit for the contended case to
> offset potential disadvantages for the uncontended case in which the
> lock is not yet in the cache?  That is, whether the unnecessary
> transition to write ownership when the lock is already acquired is more
> important to avoid than the potential transition to shared first
> followed by a transition to ownership?
> (I'm not familiar with powerpc enough to answer any of that, I just want
> to make sure the reason for our decisions are clear and documented.
> I've done a few measurements on other architectures in the past, and
> there the test-and-TAS wasn't necessarily better...)

In fact this 'hint' that a test sniff should be used came from directly
from our chip designers.  However their were not really clear why the sniff
is really necessary, just that lwarx. is indeed way more costly and can
change contention and add internal bus traffic in contend case.  So in 
a short, I am not 100% sure this is the best course of action.

>
> Do you have benchmarks to show the effects?  In this particular case,
> I'd trust the machine maintainers with keeping the decisions we make
> up-to-date wrt current hardware (ie, maintain the code so that we always
> make the best trade-offs in the future); but it would be even better if
> we had benchmarks.

That's the trick part: we are discussing this w.r.t some internal workloads nad
results are still yet to come.  However, I am trying myself to check some results
using a subpar benchmark (basically sysbench memory using spinlock instead of 
pthread_mutex_t).  I am seeing some results with hight threads counts, but also
some regression on lower threads counts.  Also, the lwarx hint bit is showing
some difference and also it shows some gains and regressions.  In the end, I think
the workloads I am using is not best suited for this kind of evaluation. Any
suggestions?

> Have you investigated with adding back-off for the contended case?

That something I have not investigate.

>
> What are your assumptions about the workload and the relative importance
> of individual cases (e.g., likelihood of contended vs. uncontended)?  We
> should agree on those globally (ie, across archs), and document them.

In fact I did not hold any assumptions, however I have noted on recent powerpc
ports I have been involved that spinlocks are used on various DBs (and usually
with hand-crafted ones...).  However I do not have data about what kind of case
they assume either.

>
>> * Use the correct EH hint bit on the larx for supported ISA.  For lock
>>   acquisition, the thread that acquired the lock with a successful stcx
>>   do not want to give away the write ownership on the cacheline. The
>>   idea is to make the load reservation "sticky" about retaining write
>>   authority to the line.  That way, the store that must inevitably come
>>   to release the lock can succeed quickly and not contend with other
>>   threads issuing lwarx.  If another thread does a store to the line
>>   (false sharing), the winning thread must give up write authority to
>>   The proper value of EH for the larx for a lock acquisition is 1.
> I would strongly prefer if we could use C code for our locks, including
> the spinlocks.
>
> Ignoring the cacheline ownership hint, is there any reason to use
> assembly here (with either the current assembly implementation of
> atomics that we have, or the new GCC __atomic builtins)?  If so, what
> could / would we need to fix in these implementations of atomics?  Is
> there a measurable difference?
>
> Regarding the ownership hint, what about extending the atomics that we
> have with support for that?  This hint could be useful in other
> concurrent code too, and we don't want to have to clone the, say, whole
> rwlock implementation for powerpc just to add a few such hints.

We already have support for EH hint in LL/SC instruction for powerpc.  The trick
in this algorithm is it does not really fix in current atomic semantics we have
so far, as '__arch_compare_and_exchange_val_32_acq' issues an 'isync' in either
fail or succeed case.  The trick in this case is just to emit the 'isync' just
after the stcx. instruction.  Basically the idea is to loop over checking
when lock is release using default loads 'inside' the CAS-acquire.  Suggestions?

>
>> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> new file mode 100644
>> index 0000000..7af694f
>> --- /dev/null
>> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> @@ -0,0 +1,28 @@
>> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
>> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include "pthreadP.h"
>> +#include <lowlevellock.h>
>> +
>> +int
>> +pthread_spin_unlock (pthread_spinlock_t *lock)
>> +{
>> +  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
>> +  *lock = 0;
>> +  return 0;
>> +}
> Is there any reason to not use "atomic_store_release (lock)" here?  Or
> adapt the generic pthread_spin_unlock to use just a release barrier
> (which I'd be open to doing too), and get rid of the powerpc-specific
> file?

atomic_store_release (lock) is generating correct code, I will change it, thanks.

References:
- [PATCH] powerpc: Spinlock optimization and cleanup
  - From: Adhemerval Zanella
- Re: [PATCH] powerpc: Spinlock optimization and cleanup
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]