This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
- From: Carlos O'Donell <carlos at redhat dot com>
- To: Kemi Wang <kemi dot wang at intel dot com>, Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Florian Weimer <fweimer at redhat dot com>, Rical Jason <rj at 2c3t dot io>, Glibc alpha <libc-alpha at sourceware dot org>
- Cc: Dave Hansen <dave dot hansen at linux dot intel dot com>, Tim Chen <tim dot c dot chen at intel dot com>, Andi Kleen <andi dot kleen at intel dot com>, Ying Huang <ying dot huang at intel dot com>, Aaron Lu <aaron dot lu at intel dot com>, Lu Aubrey <aubrey dot li at intel dot com>
- Date: Fri, 6 Jul 2018 14:04:22 -0400
- Subject: Re: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
- References: <1530863409-326-1-git-send-email-kemi.wang@intel.com> <1530863409-326-2-git-send-email-kemi.wang@intel.com>
On 07/06/2018 03:50 AM, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before
> calling into the kernel to block. But, in the current implementation of
> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
> the lock is contended, it is not a good idea on many targets as that will
> force expensive memory synchronization among processors and penalize other
> running threads. For example, it constantly floods the system with "read
> for ownership" requests, which are much more expensive to process than a
> single read. Thus, we only use MO read until we observe the lock to not be
> acquired anymore, as suggested by Andi Kleen.
>
> Performance impact:
> It would bring some benefit in the scenarios with severe lock contention on
> many architectures (significant performance improvement is not expected),
> and the whole system performance can benefit from this modification because
> a number of unnecessary "read for ownership" requests which stress the
> cache system by broadcasting cache line invalidity are eliminated during
> spinning.
>
> Meanwhile, it may have some tiny performance regression on the lock holder
> transformation for the case of lock acquisition via spinning gets, because
> the lock state is checked before acquiring the lock via trylock.
>
> Similar mechanism has been implemented for pthread spin lock.
Why should I accept this patch?
You make a strong case about the cost of the expensive memory synchronization.
However, the numbers don't appear to back this up.
If the cost of the synchronization was high, when you add the spinning, why
doesn't it improve performance?
Do you need to do a whole system performance measurement?
As it stands it looks like this patch makes the general use case of 1-4
threads roughly 5% slower across a variety of workloads.
I'm not inclined to include this work unless there is some stronger
justification, or perhaps I have just misunderstood the numbers you
have provided.
--
Cheers,
Carlos.