[PATCH v4 0/3] Optimize CAS [BZ #28537]

Wed Nov 10 21:33:26 GMT 2021

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%

Small regressions on uncontended locks are expected due to extra
check.   What do you get with my current branch

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

BTW, how did you compare the 2 results?  I tried compare_bench.py
and got

Traceback (most recent call last):
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 196, in <module>
    main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 165, in main
    bench1 = bench.parse_bench(bench1, schema)
  File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
line 137, in parse_bench
    bench = json.load(benchfile)
  File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)

-- 
H.J.