This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Mutex: Use test and cmpxchg instead of cmpxchg while spinning


Hi, Carlos and other gentle maintainers
   May I have your time to revisit this patch? Since many analysis and
testing work done shows that it is indeed helpful on x86 architecture.
Thanks

On 2018年07月30日 08:42, Kemi Wang wrote:
> The pthread adaptive spin mutex uses cmpxchg instead of test and cmpxchg
> while spinning on the lock. The first is very unfriendly to the uncore
> because it consistently floods the system with "read for ownership"
> requests, which are much more expensive to process than a single read.
> 
> ============================Some Background===============================
> For Intel x86 architecture:
> Memory coherence between cores is handled by the "uncore" (roughly equates
> to logic outside the CPU cores but residing on the same die). Cores can ask
> the uncore to "read" or "read for ownership".  If the request is "read"
> then generally the cacheline will wind up being marked SHARED in all caches
> that have a copy, or will wind up EXCLUSIVE in the requesting core if no
> one else had it before.  If the request is "read for ownership" any other
> copies will be invalidated (or written-back and invalidated if MODIFIED)
> and the requesting core will get the line in EXCLUSIVE mode.
> 
> When a read for ownership comes to the uncore, all copies in other caches
> must be invalidated.  If there is only one other copy, it may be MODIFIED
> and will have to be written back before being invalidated.
> 
> The uncore does these things by sending "snoops" to affected cores on the
> same or other sockets and waiting for the answers.
> 
> When a read comes to the uncore, if there are more than one copy already in
> cores, then adding another one does not require a snoop.  If there is only
> one other copy, a snoop is required to find out what state it is in and to
> force a transition to SHARED state.
> 
> A lock cmpxchg prevents any other core from modifying the line between the
> read phase of cmpxchg and the (potential) write phase.  The way this is
> done is to "read for ownership" the line, which will make it EXCLUSIVE.
> Then the core defers responding to a snoop from the uncore until the write
> phase is complete.  This prevents any other core from acquiring the
> necessary exclusive access to the cacheline until the instruction is
> complete.
> 
> With this background, we can explain how lock cmpxchg works and understand
> the benefit of test and cmpxchg vs cmpxchg for implementation of locks.
> 
> ============================Performance Impact============================
> Now think about the case that a lock is in alternate use by two cores.  At
> the conclusion of an unlock, the lock's cacheline is in MODIFIED state in
> core A which did the unlock.
> 
> If core B tries for a lock while core A is holding it, then core B will
> take away the cache line from core A and only then find out the lock is
> held.  Core B will then spin, testing the lock over and over again and
> making no progress until core A finally releases it.  The release will be
> expensive because core A will have to do an RFO to get the line back, and
> then hold off core B's snoops until it can release the lock.  Then core B
> will grab it back and successfully acquire.
> 
> In the two-contending cores case, lock-cmpxchg is not too bad.  The real
> problem happens with three or more contenders.  In this case, every time
> one of the contenders attempts to lock the lock with a lock cmpxchg, a
> flurry of RFO transactions happens.
> 
> Generally, test and test and set is better.  To acquire the lock, you first
> "test" with an ordinary MOV and only when the lock appears free do you try
> the lock cmpxchg.  The MOV puts the line into SHARED state, and everyone
> waiting can poll locally in their own caches without causing any uncore
> traffic. When core A finally unlocks, the uncore will have to invalidate
> everyone, then grant ownership to A for the unlock.  Then all the other
> cores will pile on again and one of them will win.
> 
> Now, let's go back to see the impact of test and cmpxchg on adaptive mutex:
> For the case with no/little lock contention, the lock performance is nearly
> no any difference between two approaches, because lock is usually acquired
> via immediate gets in the fast path.
> 
> For the case with slight lock contention, lock usually is acquired via
> either immediate gets or spinning gets, test and cmpxchg performs better
> than the cmpxchg way even if the first one has one more test operation.
> This is probably because, in regard of lock acquisition (decode as "lock
> cmpxchg") and lock release (decode as "lock dec"), the snoop of uncore
> originated from RFOs of requesting core can be responded immediately due to
> fewer in-flight snoops.
> 
> For the case of severe lock contention (E.g. the contending thread number
> is large), the lock performance improvement may not be obvious even if
> significant RFOs number is reduced. Because the behavior of adaptive mutex
> is changed, and some of lock acquisition is shifted from spinning gets to
> waking up. In such case, the lock performance is dominated by the overhead
> of futex syscall and context switch. Even so, test and cmpxchg also has its
> value because unnecessary uncore traffic is reduced in the whole system.
> 
> ================================Testing===================================
> Test Machine:
> Intel 2-sockets Skylake platform (56 cores, 62G RAM)
> 
> Methodology:
> Let's assume the size of critical section is represented by *s*, the size
> of non-critical section is represented by *t*, and let t = k*s. Then, on a
> single thread, the arrival rate at which a single core will try to acquire
> the lock, in the absence of contention, is 1/(k+1). We also assume there
> are *n* threads contending for a lock, each thread binds to an individual
> CPU core, and does the following:
> 1) lock
> 2) spend *s* nanoseconds in the critical section
> 3) unlock
> 4) spend *t* nanoseconds in the non-critical section
> in a loop until each thread reaches 100,000 iteration, the performance is
> measured by RFOs number and the average latency of lock acquisition and
> lock release.
> *Note*: the latency is measured by CPU cycles with the help of RDTSCP
> instruction. The beginning time frame is recorded before calling
> lock/unlock, and ending time frame is recorded once lock/unlock is
> returned. The delta is calculated as the latency of lock acquisition and
> lock release, respectively. We have got rid of invalid data in the
> statistic result (e.g. Interruption is handled in that core during calling
> lock/unlock).
> 
> To emulate different usage scenarios, we let k=6, s=200ns and run this
> workload with the different spinning methods. In our workload, [1-5]
> threads contending for a lock emulates little(no) lock contention, and
> [6-15] threads contending for a lock emulates slight lock contention, and
> [16-56] threads contending for a lock emulates severe lock contention
> across sockets (Benchmark is provided by Andi Kleen).
> 
> Test Command:
> perf stat -a -e offcore_requests.demand_rfo ./run_workload
> 
> Result:
> Generally, test and cmpxchg performs consistently better than cmpxchg with
> lock contention, see details below:
> For the case with little/no lock contention, no obvious difference of lock
> performance between two approaches.
> For the case with slight lock contention (using thread number 8 as an
> example), the test and cmpxchg way performs better than the cmpxchg way,
> the average latency of lock acquisition is reduced from 7501 cycles to
> 5396 cycles (-28.1%), the average latency of lock release is reduced from
> 1704 cycles to 1095 cycles (-35.7%), and total RFOs number is reduced from
> 22239643 to 12484358 (-43.9%).
> For the case with severe lock contention (using thread number 56 as an
> example), the test and cmpxchg way performs better than the cmpxchg way,
> the average latency of lock acquisition is reduced from 136787 cycles to
> 106050 cycles (-22.5%), the average latency of lock release is reduced from
> 57787 cycles to 49895 cycles (-13.7%), and total RFOs number is reduced
> from 317483283 to 215756961 (-32.0%).
> 
> Many thanks to H.J. Lu to help refine this patch, and to Stewart, Lawrence
> to explain these "lock compxchg" matters to me in details.
> 
>     * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
>     * nptl/pthread_mutex_timedlock.c: Likewise
>     * nptl/pthread_spinlock.h: New file
>     * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file
> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c                      |  3 ++-
>  nptl/pthread_mutex_timedlock.c                 |  4 ++--
>  nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
>  sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
>  4 files changed, 58 insertions(+), 3 deletions(-)
>  create mode 100644 nptl/pthread_spinlock.h
>  create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 1519c14..c910ec4 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -25,6 +25,7 @@
>  #include "pthreadP.h"
>  #include <atomic.h>
>  #include <lowlevellock.h>
> +#include <pthread_spinlock.h>
>  #include <stap-probe.h>
>  
>  #ifndef lll_lock_elision
> @@ -133,7 +134,7 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>  		  LLL_MUTEX_LOCK (mutex);
>  		  break;
>  		}
> -	      atomic_spin_nop ();
> +	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 28237b0..2ede5a0 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -25,7 +25,7 @@
>  #include <atomic.h>
>  #include <lowlevellock.h>
>  #include <not-cancel.h>
> -
> +#include <pthread_spinlock.h>
>  #include <stap-probe.h>
>  
>  #ifndef lll_timedlock_elision
> @@ -126,7 +126,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>  					  PTHREAD_MUTEX_PSHARED (mutex));
>  		  break;
>  		}
> -	      atomic_spin_nop ();
> +	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
>  	    }
>  	  while (lll_trylock (mutex->__data.__lock) != 0);
>  
> diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
> new file mode 100644
> index 0000000..8bd7c16
> --- /dev/null
> +++ b/nptl/pthread_spinlock.h
> @@ -0,0 +1,23 @@
> +/* Functions for pthread_spinlock_t.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static __always_inline void
> +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
> +{
> +  atomic_spin_nop ();
> +}
> diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> new file mode 100644
> index 0000000..5ca84d1
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> @@ -0,0 +1,31 @@
> +/* Functions for pthread_spinlock_t.  X86 version.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static __always_inline void
> +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
> +{
> +  int val = 0;
> +  int cnt = *cnt_p;
> +  do
> +    {
> +      atomic_spin_nop ();
> +      val = atomic_load_relaxed (lock);
> +    }
> +  while (val != 0 && ++cnt < max_cnt);
> +  *cnt_p = cnt;
> +}
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]