This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Lock elision test results
- From: Dominik Vogt <vogt at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Fri, 14 Jun 2013 12:26:53 +0200
- Subject: Lock elision test results
- Reply-to: vogt at linux dot vnet dot ibm dot com
Test results on a zEC12 with eight cpus with Andi's the lock
elision v10 patches ported to z/architecture. Unfortunately I
cannot provide the source code used for the tests at the moment,
but I can share relative performance data. I plan to create a
collection of test programs that can be used to measure elision
performance in specific cases.
The tests were run on 13th of June, 2013.
Test 1
======
Setup
-----
Two concurrent threads using pthread mutexes (m1, m2) and
counters c1, c2, c3. All static data structures are allocated
in separate cache lines.
thread 1:
barrier
repeat <n> times
lock m1
lock m2
increment c1
unlock m1
increment c2
repeat <m> times
waste a minimal amount of cpu
unlock m2
signal that thread 1 has finished its work
barrier
thread 2:
barrier
get start timestamp
while thread 1 has not finished
lock m1
increment c3
unlock m2
get end timestamp
Performance is measured in loops of thread 2 divided by the time
taken.
Test execution
--------------
The test is run ten times each with four different versions and
setups of glibc:
(1) current glibc without elision patchs (2506109403de)
(2) glibc-2.15
(3) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=none
(4) current glibc (1) plus elision patchs, GLIBC_PTHREAD_MUTEX=elision
The best results of all runs for each glibc setup are compared.
The result for (1) is the reference (i.e. 100%). Higher values
mean higher relative performance.
Result
------
(1) unpatched : 100.00%
(2) old glibc : 101.83%
(3) elision off: 77.87%
(4) elision on : 29.37%
The abort ratio in (4) is >= 75% on thread 1 and < 1% on thread 2.
Test 2 (nested locks)
======
Setup
-----
Three concurrent threads using pthread mutexes (m1, ..., m10) and
counters c1, ..., c10. All static data structures are allocated
in separate cache lines.
all threads:
barrier
take start timestamp (only thread 1)
repeat <n> times
lock m1, increment c1
lock m2, increment c2
...
lock m10, increment c10
unlock m10
unlock m9
...
unlock m1
barrier
take end timestamp (only thread 1)
Performance is measured in the inverse of the time taken on thread
1.
Test execution
--------------
Identical to test 1.
Result
------
(1) unpatched : 100.00%
(2) old glibc : 134.35%
(3) elision off: 56.45%
(4) elision on : 31.31%
The abort ratio in (4) in all threads is between 5% and 10%.
Test 3 (cacheline pingpong)
======
Setup
-----
Four concurrent threads using a pthread mutexes m and c1, ..., c4.
All static data structures are allocated in separate cache lines.
thread <n>:
barrier
take start timestamp (only thread 1)
barrier
repeat <m> times
lock m
increment c<n>
unlock m
barrier
take end timestamp (only thread 1)
Performance is measured in the inverse of the time taken on thread
1.
Test execution
--------------
Identical to test 1.
Result
------
(1) unpatched : 100.00%
(2) old glibc : 103.94%
(3) elision off: 76.25%
(4) elision on : 373.38%
The abort ratio in (4) in all threads is < 0.01%.
Ciao
Dominik ^_^ ^_^
--
Dominik Vogt
IBM Germany