This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
- From: kemi <kemi dot wang at intel dot com>
- To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Florian Weimer <fweimer at redhat dot com>, Rical Jason <rj at 2c3t dot io>, Carlos Donell <carlos at redhat dot com>, Glibc alpha <libc-alpha at sourceware dot org>
- Cc: Dave Hansen <dave dot hansen at linux dot intel dot com>, Tim Chen <tim dot c dot chen at intel dot com>, Andi Kleen <andi dot kleen at intel dot com>, Ying Huang <ying dot huang at intel dot com>, Aaron Lu <aaron dot lu at intel dot com>, Lu Aubrey <aubrey dot li at intel dot com>
- Date: Thu, 2 Aug 2018 08:34:42 +0800
- Subject: Re: [PATCH v2 0/5] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
- References: <email@example.com>
Hi, Gentle maintainers
Could we get this patch series reviewed next? Thanks
On 2018年07月13日 14:52, Kemi Wang wrote:
> The pthread adaptive mutex is designed based-on the truth that the lock
> probably would be released after a few of CPU cycles in most cases.
> Especially for the case: the applications code is well-behaved with a short
> critical section that is highly contended. Thus, spinning on the lock for
> a while instead of calling into the kernel to block immediately can help
> to improve the system performance.
> But there are two main problems in current adaptive mutex. The first one is
> fairness, multiple spinners contend for the lock simultaneously and there
> is no any guarantee for a spinner to acquire the lock no matter how long it
> has been waiting for. The other is the heavy cache line bouncing. Since the
> cache line including the lock is shared among all of the spinners, when
> lock is released, each spinner will try to acquire the lock via cmpxchg
> instruction which constantly floods the system via "read-for-ownership"
> requests. As a result, there will be a lot of cache line bouncing in a
> large system with a lots of CPUs.
> To address the problems mentioned above, the idea for queuing mutex
> spinners with MCS lock referred to the implementation of mutex in
> kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is
> introduced. Compare to adaptive mutex (read only while spinning), the test
> result on Intel 2 sockets Skylake platform has showns significant
> performance improvement (See the first patch for details).
> Though, queuing spinner with mcs lock can help to improve the performance
> of adaptive mutex when multiple threads contending for a lock, people
> probably want to know how queue spinner mutex performs when compare to
> other lock discipline widely knowns as pthread spin lock and pthread mutex.
> We can see the details of the test result in the first patch. Simply, some
> conclusion is summarized as below:
> a) In case of little lock contention, spin lock performs best, queue
> spinner mutex performs similar to adaptive mutex, and both perform a
> little better than pthread mutex.
> b) In the case of severe lock contention with large number of CPUs when
> protecting a small critical section (less than 1000ns). Most of lock
> acquisition is got via spinning. Queue spinner mutex.
> performs much better than spin lock and adaptive mutex. This is because the
> overhead of heavy cache line bouncing plays a big role on lock performance.
> c) With the increase of the size of a critical section, the advantage of
> queue spinner mutex on performance in reduced gradually. This is because
> the overhead of cache line bouncing will not become the bottleneck of lock
> performance, instead, the overhead of futex_wait and futex_wake
> plays a big role. When the critical section is increased to 1ms, even the
> latency of futex syscall would be ignorable compared to the total time of
> lock acquisition.
> As we can see above, queue spinner mutex performs well in kinds of workload,
> but there would be a potential risk to use this type of mutex. When the
> lock holder is transformed to the next spinner in the queue, but it is not
> running (the CPU is scheduled to run other task). Thus, the other spinners
> have to wait in the queue, this would probably collapse the lock performance.
> To emulate this case, we run two same processes simultaneously, the
> process has 28 threads each of which sets CPU affinity to an individual CPU
> according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
> In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
> 58.1% (2205245->924263).
> Therefore, queue spinner mutex would be carefully used for applications to
> pursue fairness and performance without oversubscribing CPU resource. E.g.
> Containers in public cloud infrastructure.
> To overcome the disadvantage of lock performance collapse in the legacy MCS
> lock when CPUs are oversubscribed, we proposed an enhanced MCS lock
> algorithm in the second patch which allows a spinner to spin on mutex lock
> without having to be waken up by its previous spinner. In such case, this
> new mutex type would be widely and safely used in kinds of usage scenarios.
> a) Propose an enhanced MCS lock algrithm to avoid potential lock
> performance collapse.
> b) Add a link of the paper which introduces original MCS lock.
> c) Add the commit id for mutex implementation with MCS lock in kernel.
>  Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for scalable
> synchronization on shared-memory multiprocessors." ACM Transactions on
> Computer Systems (TOCS) 9.1 (1991): 21-65.
>  Commit id for mutex with MCS lock in kernel:
> Kemi Wang (5):
> Mutex: Queue spinners to reduce cache line bouncing and ensure
> MCS lock: Enhance legacy MCS lock algorithm
> Mutex: add unit tests for new type
> BENCHMARK: add a benchmark for testing new type of mutex
> Manual: Add manual for pthread mutex
> benchtests/Makefile | 4 +-
> benchtests/bench-mutex-adaptive-thread.c | 8 ++-
> benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++
> manual/Makefile | 2 +-
> manual/mutex.texi | 68 +++++++++++++++++++++
> nptl/Makefile | 8 +--
> nptl/allocatestack.c | 2 +-
> nptl/descr.h | 26 ++++----
> nptl/mcs_lock.c | 88 ++++++++++++++++++++++++++++
> nptl/mcs_lock.h | 21 +++++++
> nptl/nptl-init.c | 2 +-
> nptl/pthreadP.h | 2 +-
> nptl/pthread_mutex_init.c | 3 +-
> nptl/pthread_mutex_lock.c | 40 ++++++++++++-
> nptl/pthread_mutex_timedlock.c | 35 ++++++++++-
> nptl/pthread_mutex_trylock.c | 5 +-
> nptl/pthread_mutex_unlock.c | 7 ++-
> nptl/pthread_mutexattr_settype.c | 2 +-
> nptl/tst-initializers1.c | 11 ++--
> nptl/tst-mutex5b.c | 2 +
> nptl/tst-mutex7b.c | 2 +
> sysdeps/nptl/bits/thread-shared-types.h | 22 +++++--
> sysdeps/nptl/pthread.h | 15 +++--
> sysdeps/unix/sysv/linux/hppa/pthread.h | 4 ++
> 24 files changed, 350 insertions(+), 50 deletions(-)
> create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
> create mode 100644 manual/mutex.texi
> create mode 100644 nptl/mcs_lock.c
> create mode 100644 nptl/mcs_lock.h
> create mode 100644 nptl/tst-mutex5b.c
> create mode 100644 nptl/tst-mutex7b.c