This is the mail archive of the
mailing list for the glibc project.
RE: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
- From: "Wang, Kemi" <kemi dot wang at intel dot com>
- To: 'Carlos O'Donell' <carlos at redhat dot com>, Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Florian Weimer <fweimer at redhat dot com>, "Rical Jason" <rj at 2c3t dot io>, Glibc alpha <libc-alpha at sourceware dot org>
- Cc: Dave Hansen <dave dot hansen at linux dot intel dot com>, "Chen, Tim C" <tim dot c dot chen at intel dot com>, "Kleen, Andi" <andi dot kleen at intel dot com>, "Huang, Ying" <ying dot huang at intel dot com>, "Lu, Aaron" <aaron dot lu at intel dot com>, "Li, Aubrey" <aubrey dot li at intel dot com>
- Date: Sun, 8 Jul 2018 14:04:42 +0000
- Subject: RE: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
- References: <email@example.com> <firstname.lastname@example.org> <email@example.com>
Thanks for your review and your questions.
> Why should I accept this patch?
Current implementation of spinning during the lock is held uses CAS (compare and swap) instead
of the way we called test and CAS.
The first is very unfriendly to the uncore because it constantly floods
the system with "read for ownership" requests, which are much more expensive
to process than a single read.
Andi Kleen may have more input here.
> You make a strong case about the cost of the expensive memory synchronization.
> However, the numbers don't appear to back this up.
As I have said in the commit log, significant performance improvement is not expected.
> If the cost of the synchronization was high, when you add the spinning, why doesn't it improve performance?
To be simple, we can think the time spent in the critical section and non-critical section is constant (I used lock delay
and unlock delay to emulate the workload), lock performance is mainly determined by the latency of lock holder
This patch reduces meaningless cache line traffic during the lock is *held*, in other words, it *optimizes* the process during
the lock is held. So, the test result is expected.
However, if the lock contention is severe, too much "read-for-ownership" requests may affect the latency of lock holder transition in
cache coherency system, in such case, it would improve system performance, as we can see in the test result with 28/56/112 threads.
Please notice, if the critical section increases to a certain degree, the lock performance is mainly determined by lock acquisition via
futex_wake. In such case, the threshold of spin count would affect lock performance, and this patch performs similar to original
version because we use the same threshold (100).
> Do you need to do a whole system performance measurement?
I thought the data posted here is good enough to demonstrate the effectiveness of this patch.
But if you insist, I will try to do something to figure it out.
> As it stands it looks like this patch makes the general use case of 1-4 threads roughly 5% slower across a variety of workloads.
No, this patch did not change the fast path. So it would not affect the general lock performance (no contention) theoretically.
Also, I don't know why you said this patch make general case roughly 5% slower? instead, it performs a little better than original
version for most cases.
Further, this patch also makes the use case of 5-10 threads better than original one in my test, maybe I should post them in the commit log?
> I'm not inclined to include this work unless there is some stronger justification, or perhaps I have just misunderstood the numbers you have provided.
I am not sure whether my reply answers your questions.
If still in doubt, feel free to raise your questions and I will try to figure it out. Thanks
From: Carlos O'Donell [mailto:firstname.lastname@example.org]
Sent: Saturday, July 7, 2018 2:04 AM
To: Wang, Kemi <email@example.com>; Adhemerval Zanella <firstname.lastname@example.org>; Florian Weimer <email@example.com>; Rical Jason <firstname.lastname@example.org>; Glibc alpha <email@example.com>
Cc: Dave Hansen <firstname.lastname@example.org>; Chen, Tim C <email@example.com>; Kleen, Andi <firstname.lastname@example.org>; Huang, Ying <email@example.com>; Lu, Aaron <firstname.lastname@example.org>; Li, Aubrey <email@example.com>
Subject: Re: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
On 07/06/2018 03:50 AM, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before
> calling into the kernel to block. But, in the current implementation
> of spinning, the spinners go straight back to
> LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a
> good idea on many targets as that will force expensive memory
> synchronization among processors and penalize other running threads.
> For example, it constantly floods the system with "read for ownership"
> requests, which are much more expensive to process than a single read.
> Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggested by Andi Kleen.
> Performance impact:
> It would bring some benefit in the scenarios with severe lock
> contention on many architectures (significant performance improvement
> is not expected), and the whole system performance can benefit from
> this modification because a number of unnecessary "read for ownership"
> requests which stress the cache system by broadcasting cache line
> invalidity are eliminated during spinning.
> Meanwhile, it may have some tiny performance regression on the lock
> holder transformation for the case of lock acquisition via spinning
> gets, because the lock state is checked before acquiring the lock via trylock.
> Similar mechanism has been implemented for pthread spin lock.
Why should I accept this patch?
You make a strong case about the cost of the expensive memory synchronization.
However, the numbers don't appear to back this up.
If the cost of the synchronization was high, when you add the spinning, why doesn't it improve performance?
Do you need to do a whole system performance measurement?
As it stands it looks like this patch makes the general use case of 1-4 threads roughly 5% slower across a variety of workloads.
I'm not inclined to include this work unless there is some stronger justification, or perhaps I have just misunderstood the numbers you have provided.