This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/3] Mutex: Only read while spinning

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: kemi <kemi dot wang at intel dot com>, libc-alpha at sourceware dot org
Cc: Andi Kleen <andi dot kleen at intel dot com>, "Chen, Tim C" <tim dot c dot chen at intel dot com>, Dave <dave dot hansen at linux dot intel dot com>, "ying.huang" <ying dot huang at intel dot com>, aaron <aaron dot lu at intel dot com>, aubrey <aubrey dot li at intel dot com>
Date: Wed, 11 Apr 2018 10:28:13 -0300
Subject: Re: [PATCH 2/3] Mutex: Only read while spinning
Autocrypt: addr=adhemerval dot zanella at linaro dot org; prefer-encrypt=mutual; keydata= xsFNBFcVGkoBEADiQU2x/cBBmAVf5C2d1xgz6zCnlCefbqaflUBw4hB/bEME40QsrVzWZ5Nq 8kxkEczZzAOKkkvv4pRVLlLn/zDtFXhlcvQRJ3yFMGqzBjofucOrmdYkOGo0uCaoJKPT186L NWp53SACXguFJpnw4ODI64ziInzXQs/rUJqrFoVIlrPDmNv/LUv1OVPKz20ETjgfpg8MNwG6 iMizMefCl+RbtXbIEZ3TE/IaDT/jcOirjv96lBKrc/pAL0h/O71Kwbbp43fimW80GhjiaN2y WGByepnkAVP7FyNarhdDpJhoDmUk9yfwNuIuESaCQtfd3vgKKuo6grcKZ8bHy7IXX1XJj2X/ BgRVhVgMHAnDPFIkXtP+SiarkUaLjGzCz7XkUn4XAGDskBNfbizFqYUQCaL2FdbW3DeZqNIa nSzKAZK7Dm9+0VVSRZXP89w71Y7JUV56xL/PlOE+YKKFdEw+gQjQi0e+DZILAtFjJLoCrkEX w4LluMhYX/X8XP6/C3xW0yOZhvHYyn72sV4yJ1uyc/qz3OY32CRy+bwPzAMAkhdwcORA3JPb kPTlimhQqVgvca8m+MQ/JFZ6D+K7QPyvEv7bQ7M+IzFmTkOCwCJ3xqOD6GjX3aphk8Sr0dq3 4Awlf5xFDAG8dn8Uuutb7naGBd/fEv6t8dfkNyzj6yvc4jpVxwARAQABzUlBZGhlbWVydmFs IFphbmVsbGEgTmV0dG8gKExpbmFybyBWUE4gS2V5KSA8YWRoZW1lcnZhbC56YW5lbGxhQGxp bmFyby5vcmc+wsF3BBMBCAAhBQJXFRpKAhsDBQsJCAcDBRUKCQgLBRYCAwEAAh4BAheAAAoJ EKqx7BSnlIjv0e8P/1YOYoNkvJ+AJcNUaM5a2SA9oAKjSJ/M/EN4Id5Ow41ZJS4lUA0apSXW NjQg3VeVc2RiHab2LIB4MxdJhaWTuzfLkYnBeoy4u6njYcaoSwf3g9dSsvsl3mhtuzm6aXFH /Qsauav77enJh99tI4T+58rp0EuLhDsQbnBic/ukYNv7sQV8dy9KxA54yLnYUFqH6pfH8Lly sTVAMyi5Fg5O5/hVV+Z0Kpr+ZocC1YFJkTsNLAW5EIYSP9ftniqaVsim7MNmodv/zqK0IyDB GLLH1kjhvb5+6ySGlWbMTomt/or/uvMgulz0bRS+LUyOmlfXDdT+t38VPKBBVwFMarNuREU2 69M3a3jdTfScboDd2ck1u7l+QbaGoHZQ8ZNUrzgObltjohiIsazqkgYDQzXIMrD9H19E+8fw kCNUlXxjEgH/Kg8DlpoYJXSJCX0fjMWfXywL6ZXc2xyG/hbl5hvsLNmqDpLpc1CfKcA0BkK+ k8R57fr91mTCppSwwKJYO9T+8J+o4ho/CJnK/jBy1pWKMYJPvvrpdBCWq3MfzVpXYdahRKHI ypk8m4QlRlbOXWJ3TDd/SKNfSSrWgwRSg7XCjSlR7PNzNFXTULLB34sZhjrN6Q8NQZsZnMNs TX8nlGOVrKolnQPjKCLwCyu8PhllU8OwbSMKskcD1PSkG6h3r0AqzsFNBFcVGkoBEACgAdbR Ck+fsfOVwT8zowMiL3l9a2DP3Eeak23ifdZG+8Avb/SImpv0UMSbRfnw/N81IWwlbjkjbGTu oT37iZHLRwYUFmA8fZX0wNDNKQUUTjN6XalJmvhdz9l71H3WnE0wneEM5ahu5V1L1utUWTyh VUwzX1lwJeV3vyrNgI1kYOaeuNVvq7npNR6t6XxEpqPsNc6O77I12XELic2+36YibyqlTJIQ V1SZEbIy26AbC2zH9WqaKyGyQnr/IPbTJ2Lv0dM3RaXoVf+CeK7gB2B+w1hZummD21c1Laua +VIMPCUQ+EM8W9EtX+0iJXxI+wsztLT6vltQcm+5Q7tY+HFUucizJkAOAz98YFucwKefbkTp eKvCfCwiM1bGatZEFFKIlvJ2QNMQNiUrqJBlW9nZp/k7pbG3oStOjvawD9ZbP9e0fnlWJIsj 6c7pX354Yi7kxIk/6gREidHLLqEb/otuwt1aoMPg97iUgDV5mlNef77lWE8vxmlY0FBWIXuZ yv0XYxf1WF6dRizwFFbxvUZzIJp3spAao7jLsQj1DbD2s5+S1BW09A0mI/1DjB6EhNN+4bDB SJCOv/ReK3tFJXuj/HbyDrOdoMt8aIFbe7YFLEExHpSk+HgN05Lg5TyTro8oW7TSMTk+8a5M kzaH4UGXTTBDP/g5cfL3RFPl79ubXwARAQABwsFfBBgBCAAJBQJXFRpKAhsMAAoJEKqx7BSn lIjvI/8P/jg0jl4Tbvg3B5kT6PxJOXHYu9OoyaHLcay6Cd+ZrOd1VQQCbOcgLFbf4Yr+rE9l mYsY67AUgq2QKmVVbn9pjvGsEaz8UmfDnz5epUhDxC6yRRvY4hreMXZhPZ1pbMa6A0a/WOSt AgFj5V6Z4dXGTM/lNManr0HjXxbUYv2WfbNt3/07Db9T+GZkpUotC6iknsTA4rJi6u2ls0W9 1UIvW4o01vb4nZRCj4rni0g6eWoQCGoVDk/xFfy7ZliR5B+3Z3EWRJcQskip/QAHjbLa3pml xAZ484fVxgeESOoaeC9TiBIp0NfH8akWOI0HpBCiBD5xaCTvR7ujUWMvhsX2n881r/hNlR9g fcE6q00qHSPAEgGr1bnFv74/1vbKtjeXLCcRKk3Ulw0bY1OoDxWQr86T2fZGJ/HIZuVVBf3+ gaYJF92GXFynHnea14nFFuFgOni0Mi1zDxYH/8yGGBXvo14KWd8JOW0NJPaCDFJkdS5hu0VY 7vJwKcyHJGxsCLU+Et0mryX8qZwqibJIzu7kUJQdQDljbRPDFd/xmGUFCQiQAncSilYOcxNU EMVCXPAQTteqkvA+gNqSaK1NM9tY0eQ4iJpo+aoX8HAcn4sZzt2pfUB9vQMTBJ2d4+m/qO6+ cFTAceXmIoFsN8+gFN3i8Is3u12u8xGudcBPvpoy4OoG
Openpgp: preference=signencrypt
References: <1522394093-9835-1-git-send-email-kemi.wang@intel.com> <1522394093-9835-2-git-send-email-kemi.wang@intel.com> <d45dd71a-0ab5-47b1-c1ca-1e4fe39a58d1@linaro.org> <ddd6c708-3fdd-1081-4070-e6ed351094f9@intel.com> <14c4945b-f9f8-51b6-32c9-ce3049493dab@linaro.org> <11425ecd-b51e-8396-e3d7-612f7767c870@intel.com>


On 09/04/2018 22:46, kemi wrote:
> 
> 
> On 2018年04月10日 04:52, Adhemerval Zanella wrote:
>>
>>
>> On 08/04/2018 05:28, kemi wrote:
>>>
>>>
>>> On 2018年04月06日 04:55, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 30/03/2018 04:14, Kemi Wang wrote:
>>>>> The pthread adaptive spin mutex spins on the lock for a while before going
>>>>> to a sleep. While the lock is contended and we need to wait, going straight
>>>>> back to LLL_MUTEX_TRYLOCK(cmpxchg) is not a good idea on many targets as
>>>>> that will force expensive memory synchronization among processors and
>>>>> penalize other running threads. For example, it constantly floods the
>>>>> system with "read for ownership" requests, which are much more expensive to
>>>>> process than a single read. Thus, we only use MO read until we observe the
>>>>> lock to not be acquired anymore, as suggusted by Andi Kleen.
>>>>>
>>>>> Test machine:
>>>>> 2-sockets Skylake paltform, 112 cores with 62G RAM
>>>>>
>>>>> Test case: Contended pthread adaptive spin mutex with global update
>>>>> each thread of the workload does:
>>>>> a) Lock the mutex (adaptive spin type)
>>>>> b) Globle variable increment
>>>>> c) Unlock the mutex
>>>>> in a loop until timeout, and the main thread reports the total iteration
>>>>> number of all the threads in one second.
>>>>>
>>>>> This test case is as same as Will-it-scale.pthread_mutex3 except mutex type is
>>>>> modified to PTHREAD_MUTEX_ADAPTIVE_NP.
>>>>> github: https://github.com/antonblanchard/will-it-scale.git
>>>>>
>>>>> nr_threads      base         head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>>>>> 1               51644585        51307573(-0.7%)    51323778(-0.6%)
>>>>> 2               7914789         10011301(+26.5%)   9867343(+24.7%)
>>>>> 7               1687620         4224135(+150.3%)   3430504(+103.3%)
>>>>> 14              1026555         3784957(+268.7%)   1843458(+79.6%)
>>>>> 28              962001          2886885(+200.1%)   681965(-29.1%)
>>>>> 56              883770          2740755(+210.1%)   364879(-58.7%)
>>>>> 112             1150589         2707089(+135.3%)   415261(-63.9%)
>>>>
>>>> In pthread_mutex3 it is basically more updates in a global variable synchronized
>>>> with a mutex, so if I am reading correct the benchmark, a higher value means
>>>> less contention. I also assume you use the 'threads' value in this table.
>>>>
>>>
>>> Yes. I used multi-threads mode for testing.
>>>
>>>> I checked on a 64 cores aarch64 machine to see what kind of improvement, if
>>>> any; one would get with this change:
>>>>
>>>> nr_threads      base            head(SPIN_COUNT=10)   head(SPIN_COUNT=1000)
>>>> 1               27566206        28778254 (4.211680)   28778467 (4.212389)
>>>> 2               8498813         7777589 (-9.273105)   7806043 (-8.874791)
>>>> 7               5019434         2869629 (-74.915782)  3307812 (-51.744839)
>>>> 14              4379155         2906255 (-50.680343)  2825041 (-55.012087)
>>>> 28              4397464         3261094 (-34.846282)  3259486 (-34.912805)
>>>> 56              4020956         3898386 (-3.144122)   4038505 (0.434542)
>>>>
>>>
>>> Thanks for your job to test it on aarch64 machine. That's great supplement for us.
>>>
>>>> So I think this change should be platform-specific.
>>>>
>>>
>>> It may depend on platform, especially for cmpxchg, MO read and pause instructions.
>>
>> Yes, that's why for this patch isolated it would be better to make the adaptive algorithm
>> more platform specific.
>>
>>>
>>> Well, I think the performance also depends on spin count. A simple explanation is that:
>>> spin always fails in severe lock contention with large thread numbers (fit case b) below),
>>> thus, the overhead increases due to large spin count, larger spin count, more overhead.
>>> It does not surprise me to see performance regression with SPIN_COUNT=1000, but I am surprised 
>>> that the performance does not increase with smaller SPIN_COUNT compare to larger one. 
>>> Did you run "export LD_ＳＰＩＮ_COUNT=10" before the second round test?
>>
>> I used the preferred way GLIBC_TUNABLES="glibc.mutex.spin_count=N" (which for the
>> current patch set version is the same).  I redid the number to check for some
>> mishandled from my part, but still the number are somewhat not ok to set as
>> default (even more that 1000 is the default value for spin_counts).
>>
>> nr_threads	base		head(SPIN_COUNT=10)  head(SPIN_COUNT=1000)
>> 1		27563021		28775107 (4.212273)		28775127 (4.212339)
>> 2		8261443		9044577 (8.658603)		7792347 (-6.019958)
>> 7		4697529		4623052 (-1.610992)		3241736 (-44.907821)
>> 14		4951300		5348550 (7.427247)		2854946 (-73.428849)
>> 28		4296920		4551240 (5.587928)		3231684 (-32.962257)
>> 56		4182013		3420937 (-22.247589)		3769944 (-10.930375)
>>
>>
>>>
>>> Analysis:
>>> let's assume the time for critical section is *c*, and the time for spinning the lock is *s*.
>>> Then s = k*c, *k* is factor.  
>>>
>>> a) If *k* < 1, which means the spinning time is less than the time for the task to consume 
>>> in the critical section, it is easy to understand that the overhead for acquiring the lock 
>>> equals to spin+wait+wake because spin always fails due to time out (the case of large critical section). 
>>>
>>> b) If *k* > 1 && threads number *n* >= *k* (small critical section case).
>>>    In our case, the threads do nothing but compete for the lock to enter critical section in a loop, so we
>>> can simply think that the arrival rate of critical section for each thread is 1. And, all the threads start
>>> at the same time, we can assume all the threads competes for the lock simultaneously at T(0) time frame
>>>
>>>  T(0)   task(0) <-- task(1) <-- task(2)  ...   task(n)     // original status
>>>  |
>>>  T(1)   task(1) <-- task(2)    ...   task(n) <-- task(0)   // after time c, task 0 release the lock and compete to lock again
>>>  .
>>>  .
>>>  .
>>>  T(k)   task(k) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1)  
>>>
>>> after k*c time(Time T(k)), from task(k) would get the lock and task(k+1)...task(n) would call 
>>> futex_wait() to block due to timeout, and task(0) has been spinning for time (k-1)*c, task(1)
>>> for time (k-2)*c,... task(k-1) for time c.
>>>
>>>  T(k+1)   task(k+1) <-- task(k+1)  ...   task(n) <-- task(0)  ...  task(k-1) 
>>>           sleep         sleep   
>>>
>>>  task(k) would take some time to wakeup task(k+1) which takes c time in critical section, this would repeat again and again
>>> until threads exist. Therefore, the spin count effects the system performance largely!
>>>
>>> c) If *k* > 1 && threads number *n* < *k* (small critical section case)
>>>   Theoretically, each thread can get the lock without going to block via spinning. No need wait/wake, so we can
>>> only consider spin overhead.
>>
>> I do agree that the sping count does play a role here, but for the machine
>> I am testing I think what is actually happening is this patch is adding more
>> branch on critical loop section, and does hurt performance:
>>
>> Current spin loop version:
>> ---
>>   do
>>     { 
>>       if (cnt++ >= max_cnt)
>>         { 
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       atomic_spin_nop ();
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> This patch changes to:
>> ---
>>   do
>>     {   
>>       if (cnt >= max_cnt)
>>         {
>>           LLL_MUTEX_LOCK (mutex);
>>           break;
>>         }
>>       do { 
>>         atomic_spin_nop ();
>>       }
>>       while (atomic_load_relaxed (&mutex->__data.__lock) != 0
>>              && ++cnt < max_cnt);
>>     }
>>   while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>> ---
>>
>> While 3 patch in set changes to:
>> ---
>>   do
>>     {
>>       atomic_spin_nop ();
>>      }
>>   while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
>>          ++cnt < max_cnt);
>>
>>   if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>>     LLL_MUTEX_LOCK (mutex);
>> ---
>>
>> That's why I suggested for next version to squash second and third patch on only
>> one aimed to optimize the adapting spinning algorithm. 
>>
> 
> Agree, will combine 2nd and 3th patch in V2.
> 
>> Also, if you could provide a benchmark to stress this point without resorting in
>> an external one it would be better.
>>
> 
> I will consider to do that. Preliminarily, we can provide several test cases each of which has
> different size of critical section and can be run in multi-threads mode.
> Your idea?

I would the pthread_mutex3/threads from the project you used as base, it is
simply enough and give us a direct comparable metric. I would ran with
number of threads of 1, nproc/2 and nproc.

References:
- [PATCH 2/3] Mutex: Only read while spinning
  - From: Adhemerval Zanella
- Re: [PATCH 2/3] Mutex: Only read while spinning
  - From: kemi
- Re: [PATCH 2/3] Mutex: Only read while spinning
  - From: Adhemerval Zanella
- Re: [PATCH 2/3] Mutex: Only read while spinning
  - From: kemi

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]