This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Re: [PATCH v6 3/3] Mutex: Replace trylock by read only while spinning



On 2018年07月03日 21:16, H.J. Lu wrote:
> On Tue, Jul 3, 2018 at 5:45 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang <kemi.wang@intel.com> wrote:
>>> The pthread adaptive spin mutex spins on the lock for a while before
>>> calling into the kernel to block. But, in the current implementation of
>>> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
>>> the lock is contended, it is not a good idea on many targets as that will
>>> force expensive memory synchronization among processors and penalize other
>>> running threads. For example, it constantly floods the system with "read
>>> for ownership" requests, which are much more expensive to process than a
>>> single read. Thus, we only use MO read until we observe the lock to not be
>>> acquired anymore, as suggested by Andi Kleen.
>>>
>>> Performance impact:
>>> Significant mutex performance improvement is not expected for this patch,
>>> though, it probably bring some benefit for the scenarios with severe lock
>>> contention on many architectures, the whole system performance can benefit
>>> from this modification because a number of unnecessary "read for ownership"
>>> requests which stress the cache system by broadcasting cache line
>>> invalidity are eliminated during spinning.
>>> Meanwhile, it may have some tiny performance regression on the lock holder
>>> transformation for the case of lock acquisition via spinning gets, because
>>> the lock state is checked before acquiring the lock via trylock.
>>>
>>> Similar mechanism has been implemented for pthread spin lock.
>>>
>>> Test machine:
>>> 2-sockets Skylake platform, 112 cores with 62G RAM
>>>
>>> Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
>>> with global update)
>>> Usage: make bench BENCHSET=mutex-adaptive-thread
>>> Test result:
>>> +----------------+-----------------+-----------------+------------+
>>> |  Configuration |      Base       |      Head       | % Change   |
>>> |                | Total iteration | Total iteration | base->head |
>>> +----------------+-----------------+-----------------+------------+
>>> |                |           Critical section size: 1x            |
>>> +----------------+------------------------------------------------+
>>> |1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
>>> |2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
>>> |3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
>>> |4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
>>> |28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
>>> |56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
>>> |112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
>>> +----------------+------------------------------------------------+
>>> |                |           Critical section size: 10x           |
>>> +----------------+------------------------------------------------+
>>> |1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
>>> |2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
>>> |3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
>>> |4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
>>> |28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
>>> |56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
>>> |112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
>>> +----------------+------------------------------------------------+
>>> |                |           Critical section size: 100x          |
>>> +----------------+------------------------------------------------+
>>> |1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
>>> |2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
>>> |3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
>>> |4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
>>> |28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
>>> |56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
>>> |112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
>>> +----------------+------------------------------------------------+
>>> |                |           Critical section size: 1000x         |
>>> +----------------+------------------------------------------------+
>>> |1 thread        |  316392         |  315635         |   -0.2%    |
>>> |2 threads       |  302806         |  303469         |   +0.2%    |
>>> |3 threads       |  298506         |  294281         |   -1.4%    |
>>> |4 threads       |  292037         |  289945         |   -0.7%    |
>>> |28 threads      |  155188         |  155250         |   +0.0%    |
>>> |56 threads      |  190657         |  183106         |   -4.0%    |
>>> |112 threads     |  210818         |  220342         |   +4.5%    |
>>> +----------------+-----------------+-----------------+------------+
>>>
>>>     * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex
>>>     * nptl/pthread_mutex_timedlock.c: Likewise
>>>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only
>>>       while spinning for x86 architecture
>>>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise
>>>     * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise
>>>
>>> ChangLog:
>>>     V5->V6:
>>>     no change
>>>
>>>     V4->V5:
>>>     a) Make the optimization work for pthread mutex_timedlock() in x86
>>>     architecture.
>>>     b) Move the READ_ONLY_SPIN macro definition from this patch to the
>>>     first patch which adds glibc.mutex.spin_count tunable entry
>>>
>>>     V3->V4:
>>>     a) Make the optimization opt-in, and enable for x86 architecture as
>>>     default, as suggested by Florian Weimer.
>>>
>>>     V2->V3:
>>>     a) Drop the idea of blocking spinners if fail to acquire a lock, since
>>>        this idea would not be an universal winner. E.g. several threads
>>>        contend for a lock which protects a small critical section, thus,
>>>        probably any thread can acquire the lock via spinning.
>>>     b) Fix the format issue AFAIC
>>>
>>>     V1->V2: fix format issue
>>>
>>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>>> ---
>>>  nptl/pthread_mutex_lock.c                             | 15 +++++++++++++++
>>>  nptl/pthread_mutex_timedlock.c                        | 15 +++++++++++++++
>>>  sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c |  1 +
>>>  sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c      |  1 +
>>>  sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c |  1 +
>>>  5 files changed, 33 insertions(+)
>>>
>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>> index 1519c14..9a7b5c2 100644
>>> --- a/nptl/pthread_mutex_lock.c
>>> +++ b/nptl/pthread_mutex_lock.c
>>> @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>>                   LLL_MUTEX_LOCK (mutex);
>>>                   break;
>>>                 }
>>> +#ifdef READ_ONLY_SPIN
>>> +             do
>>> +               {
>>> +                 atomic_spin_nop ();
>>> +                 val = atomic_load_relaxed (&mutex->__data.__lock);
>>> +               }
>>> +             while (val != 0 && ++cnt < max_cnt);
>>> +#else
>>>               atomic_spin_nop ();
>>> +#endif
>>>             }
>>>           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>
>>> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
>>> index 66efd39..dcaeca8 100644
>>> --- a/nptl/pthread_mutex_timedlock.c
>>> +++ b/nptl/pthread_mutex_timedlock.c
>>> @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>>>                                           PTHREAD_MUTEX_PSHARED (mutex));
>>>                   break;
>>>                 }
>>> +#ifdef READ_ONLY_SPIN
>>> +             do
>>> +               {
>>> +                 atomic_spin_nop ();
>>> +                 val = atomic_load_relaxed (&mutex->__data.__lock);
>>> +               }
>>> +             while (val != 0 && ++cnt < max_cnt);
>>> +#else
>>>               atomic_spin_nop ();
>>> +#endif
>>>             }
>>>           while (lll_trylock (mutex->__data.__lock) != 0);
>>>
>>
>> Please define generic
>>
>> static inline void __always_inline
>> atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex)
>> {
>>       atomic_spin_nop ();
>>  }
>>
>> and x86 version:
>>
>> static inline void __always_inline
>> atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex)
>> {
>>   do
>>     {
>>       atomic_spin_nop ();
>>       val = atomic_load_relaxed (&mutex->__data.__lock);
>>     }
>>   while (val != 0 && ++cnt < max_cnt);
>> }
>>
>> in a standalone patch.
>>
> 
> Please take a look at
> 
> https://github.com/hjl-tools/glibc/tree/hjl/spin/master
> 

Reviewed. Thanks for refining.

> I added:
> 
> static __always_inline void
> atomic_spin_lock (pthread_spinlock_t *lock, int cnt, int max_cnt)
> {
>   int val = 0;
>   do
>     {
>       atomic_spin_nop ();
>       val = atomic_load_relaxed (lock);
>     }
>   while (val != 0 && ++cnt < max_cnt);
> }
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]