[PATCH v4 0/3] Optimize CAS [BZ #28537]
Paul A. Clarke
pc@us.ibm.com
Thu Nov 11 00:30:21 GMT 2021
On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading. See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > Are you able to share benchmarks of this change? I am curious what effects
> > > this might have on other platforms.
> >
> > I'd like to see the expected performance results, too.
> >
> > For me, the results are not uniformly positive (Power10).
> > From bench-pthread-locks:
> >
> > bench bench-patched
> > mutex-empty 4.73371 4.54792 3.9%
> > mutex-filler 18.5395 18.3419 1.1%
> > mutex_trylock-empty 10.46 2.46364 76.4%
> > mutex_trylock-filler 16.2188 16.1758 0.3%
> > rwlock_read-empty 16.5118 16.4681 0.3%
> > rwlock_read-filler 20.68 20.4416 1.2%
> > rwlock_tryread-empty 2.06572 2.17284 -5.2%
> > rwlock_tryread-filler 16.082 16.1215 -0.2%
> > rwlock_write-empty 31.3723 31.259 0.4%
> > rwlock_write-filler 41.6492 69.313 -66.4%
> > rwlock_trywrite-empty 2.20584 2.32178 -5.3%
> > rwlock_trywrite-filler 15.7044 15.9088 -1.3%
> > spin_lock-empty 16.7964 16.7731 0.1%
> > spin_lock-filler 20.6118 20.4175 0.9%
> > spin_trylock-empty 8.99989 8.98879 0.1%
> > spin_trylock-filler 16.4732 15.9957 2.9%
> > sem_wait-empty 15.805 15.7391 0.4%
> > sem_wait-filler 19.2346 19.5098 -1.4%
> > sem_trywait-empty 2.06405 2.03782 1.3%
> > sem_trywait-filler 15.921 15.8408 0.5%
> > condvar-empty 1385.84 1387.29 -0.1%
> > condvar-filler 1419.82 1424.01 -0.3%
> > consumer_producer-empty 2550.01 2395.29 6.1%
> > consumer_producer-filler 2709.4 2558.28 5.6%
>
> Small regressions on uncontended locks are expected due to extra
> check. What do you get with my current branch
>
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl
bench bench-hjl
mutex-empty 4.73371 4.65279 1.7%
mutex-filler 18.5395 18.3971 0.8%
mutex_trylock-empty 10.46 10.1671 2.8%
mutex_trylock-filler 16.2188 16.7105 -3.0%
rwlock_read-empty 16.5118 16.4697 0.3%
rwlock_read-filler 20.68 20.0416 3.1%
rwlock_tryread-empty 2.06572 2.038 1.3%
rwlock_tryread-filler 16.082 15.7182 2.3%
rwlock_write-empty 31.3723 31.1147 0.8%
rwlock_write-filler 41.6492 69.8115 -67.6%
rwlock_trywrite-empty 2.20584 2.32175 -5.3%
rwlock_trywrite-filler 15.7044 15.86 -1.0%
spin_lock-empty 16.7964 16.4342 2.2%
spin_lock-filler 20.6118 20.3916 1.1%
spin_trylock-empty 8.99989 8.98884 0.1%
spin_trylock-filler 16.4732 16.1979 1.7%
sem_wait-empty 15.805 15.7558 0.3%
sem_wait-filler 19.2346 19.2554 -0.1%
sem_trywait-empty 2.06405 2.03789 1.3%
sem_trywait-filler 15.921 15.7884 0.8%
condvar-empty 1385.84 1341.96 3.2%
condvar-filler 1419.82 1343.06 5.4%
consumer_producer-empty 2550.01 2446.33 4.1%
consumer_producer-filler 2709.4 2659.59 1.8%
...still one very bad outlier, and a few of concern.
> BTW, how did you compare the 2 results? I tried compare_bench.py
> and got
>
> Traceback (most recent call last):
> File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 196, in <module>
> main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
> File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 165, in main
> bench1 = bench.parse_bench(bench1, schema)
> File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
> line 137, in parse_bench
> bench = json.load(benchfile)
> File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
> return loads(fp.read(),
> File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
> return _default_decoder.decode(s)
> File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
> raise JSONDecodeError("Extra data", s, end)
> json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)
I did it the old-fashioned way, in a spreadsheet. :-)
I see the same errors you see with compare_bench.py.
Upon further investigation, compare_bench.py expects input in the form
produced by "make bench". The output from running the benchtest directly
is insufficient. Using the respective outputs from
"make BENCHSET=bench-pthread bench":
--
$ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out
[snip]
+++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660
+++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396
+++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798
+++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185
--- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533
--- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047
--- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244
+++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436
--- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896
+++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48
--
PC
More information about the Libc-alpha
mailing list