This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH v4 3/3] Mutex: Replace trylock by read only while spinning

From: Kemi Wang <kemi dot wang at intel dot com>
To: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>, Florian Weimer <fweimer at redhat dot com>, Rical Jason <rj at 2c3t dot io>, Carlos Donell <carlos at redhat dot com>, Glibc alpha <libc-alpha at sourceware dot org>
Cc: Dave Hansen <dave dot hansen at linux dot intel dot com>, Tim Chen <tim dot c dot chen at intel dot com>, Andi Kleen <andi dot kleen at intel dot com>, Ying Huang <ying dot huang at intel dot com>, Aaron Lu <aaron dot lu at intel dot com>, Lu Aubrey <aubrey dot li at intel dot com>, Kemi Wang <kemi dot wang at intel dot com>
Date: Mon, 11 Jun 2018 18:18:59 +0800
Subject: [PATCH v4 3/3] Mutex: Replace trylock by read only while spinning
References: <1528712339-32299-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggeusted by Andi Kleen.

Performance impact:
Significant mutex performance improvement is not expected for this patch,
though, it probably bring some benefit for the scenarios with severe lock
contention on many architectures, the whole system performance can benefit
from this modification because a number of unnecessary "read for ownership"
requests which stress the cache system by broadcasting cache line
invalidity are eliminated during spinning.
Meanwhile, it may have some tiny performance regression on the lock holder
transformation for the case of lock acquisition via spinning gets, because
the lock state is checked before acquiring the lock via trylock.

Similar mechanism has been implemented for pthread spin lock.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
|2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
|3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
|4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
|28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
|56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
|112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
|2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
|3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
|4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
|28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
|56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
|112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
|2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
|3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
|4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
|28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
|56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
|112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  316392         |  315635         |   -0.2%    |
|2 threads       |  302806         |  303469         |   +0.2%    |
|3 threads       |  298506         |  294281         |   -1.4%    |
|4 threads       |  292037         |  289945         |   -0.7%    |
|28 threads      |  155188         |  155250         |   +0.0%    |
|56 threads      |  190657         |  183106         |   -4.0%    |
|112 threads     |  210818         |  220342         |   +4.5%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex
    * nptl/pthread_mutex_conf.h: Add READ_ONLY_SPIN micro definition
    * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only
      while spinning for x86 architecture
    * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise

ChangLog:
    V3->V4:
    a) Make the optimization opt-in, and enable for x86 architecture as
    default, as suggested by Florian Weimer.

    V2->V3:
    a) Drop the idea of blocking spinners if fail to acquire a lock, since
       this idea would not be an universal winner. E.g. several threads
       contend for a lock which protects a small critical section, thus,
       probably any thread can acquire the lock via spinning.
    b) Fix the format issue AFAIC

    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_conf.h                             |  2 ++
 nptl/pthread_mutex_lock.c                             | 15 +++++++++++++++
 sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c |  1 +
 sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c      |  1 +
 4 files changed, 19 insertions(+)

diff --git a/nptl/pthread_mutex_conf.h b/nptl/pthread_mutex_conf.h
index e5b027c..f2d6ca9 100644
--- a/nptl/pthread_mutex_conf.h
+++ b/nptl/pthread_mutex_conf.h
@@ -28,4 +28,6 @@ struct mutex_config
 
 extern struct mutex_config __mutex_aconf attribute_hidden;
 
+#define READ_ONLY_SPIN 1
+
 #endif
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..26bcebf 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -124,8 +124,14 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
+#ifdef READ_ONLY_SPIN
+	  int val = 0;
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			            mutex->__data.__spins * 2 + 10);
+#else
 	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
 			     mutex->__data.__spins * 2 + 10);
+#endif
 	  do
 	    {
 	      if (cnt++ >= max_cnt)
@@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
+#ifdef READ_ONLY_SPIN
+	      do
+	        {
+		      atomic_spin_nop ();
+		      val = atomic_load_relaxed (&mutex->__data.__lock);
+	        }
+	      while (val != 0 && ++cnt < max_cnt);
+#else
 	      atomic_spin_nop ();
+#endif
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c
index 967d007..a44c48c 100644
--- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c
+++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c
@@ -19,4 +19,5 @@
    already elided locks.  */
 #include <elision-conf.h>
 
+#include <nptl/pthread_mutex_conf.h>
 #include <nptl/pthread_mutex_cond_lock.c>
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c
index c23678f..29d20e8 100644
--- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c
+++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c
@@ -19,4 +19,5 @@
 #include <elision-conf.h>
 #include "force-elision.h"
 
+#include "nptl/pthread_mutex_conf.h"
 #include "nptl/pthread_mutex_lock.c"
-- 
2.7.4

References:
- [PATCH v4 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex
  - From: Kemi Wang

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]