[PATCH v4 0/3] Optimize CAS [BZ #28537]

Wed Nov 10 14:26:09 GMT 2021

On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

Are you able to share benchmarks of this change? I am curious what 
effects this might have on other platforms.

> 
> This is the first patch set to optimize CAS.  I will investigate the rest
> CAS usages in glibc after this patch set has been accepted.
> 
> H.J. Lu (3):
>    Reduce CAS in low level locks [BZ #28537]
>    Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
>    Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
> 
>   nptl/lowlevellock.c         | 12 ++++-----
>   nptl/pthread_mutex_lock.c   | 53 ++++++++++++++++++++++++++++---------
>   sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
>   3 files changed, 67 insertions(+), 27 deletions(-)
>