[PING] [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.

Tue Apr 4 10:29:00 GMT 2017

On 03/27/2017 03:08 PM, Stefan Liebler wrote:
> On 03/21/2017 04:43 PM, Stefan Liebler wrote:
>> On 03/14/2017 04:55 PM, Stefan Liebler wrote:
>>> On 02/20/2017 02:51 PM, Torvald Riegel wrote:
>>>> On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote:
>>>>>>>> A while ago we tried hard to remove all code that would fail
>>>>>>>> silently
>>>>>>>> when a macro had a typo in the name, for example.  We still have a
>>>>>>>> few
>>>>>>>> of them left (e.g., in the atomics), but I think this here doesn't
>>>>>>>> warrant an exception.
>>>>>>>>
>>>>>>> Okay. You're right.
>>>>>>>
>>>>>>> In comment of patch 2/2, you've mentioned a header where an
>>>>>>> architecture
>>>>>>> shall define those parameters.
>>>>>>> Thus I've added the file nptl/pthread_spin_parameters.h.
>>>>>>
>>>>>> I think the stub version should be in sysdeps/ but the wiki is
>>>>>> missing
>>>>>> this information (default ENOTSUP location):
>>>>>> https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location
>>>>>>
>>>>>>
>>>>>>
>>>>> In case we need a new header, shall we move it to sysdeps/nptl/ ?
>>>>
>>>> I would guess that this would be the right place for a stub / generic
>>>> variant of such a header, but I'm not quite sure.
>>>>
>>>>>>
>>>>>> I'm still a bit undecided about how to deal with other arch's
>>>>>> pthread_spin_lock.c files that just set
>>>>>> SPIN_LOCK_READS_BETWEEN_CMPXCHG
>>>>>> and then include the generic pthread_spin_lock.c.  Maybe it's best
>>>>>> if we
>>>>>> just remove them in this patch of yours.
>>>>>>
>>>>> I think the archs currently using the generic implementation have just
>>>>> copied the default value to get rid of the warning "machine-dependent
>>>>> file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is
>>>>> only an
>>>>> assumption.
>>>>> In general removing those "wrapper"-pthread_spin_lock.c files and
>>>>> using
>>>>> information from a header like the proposed
>>>>> pthread_spin_parameters.h or
>>>>> atomic-machine.h is a good approach.
>>>>
>>>> Okay.  So let's do that.
>>>>
>>> Done. The wrapper-pthread_spin_lock.c files in sysdeps subfolders are
>>> removed and the option SPIN_LOCK_READS_BETWEEN_CMPXCHG is not available
>>> anymore.
>>> Instead there is only a loop of plain reads until we observe an not
>>> acquired lock.
>>> See comment in nptl/pthread_spin_lock.c.
>>>
>>>>>> I agree that the there are architecture-specific properties of atomic
>>>>>> operations that have a significant effect on performance.  For s390,
>>>>>> you
>>>>>> list the following points:
>>>>>> * atomic exchange (I suppose all atomic read-modify-write ops?) are
>>>>>> implemented through CAS.
>>>>> atomic exchange is implemented by a CAS loop due to lack of an
>>>>> exchange
>>>>> instruction. For exchanging to a "0", s390 can use the load-and-and
>>>>> instruction instead of a CAS loop (See my atomic-machine.h patch; For
>>>>> gcc, we plan to emit such a load-and-and instruction for builtin
>>>>> __atomic_exchange_n in future;). This also saves one register usage
>>>>> compared to the CAS loop.
>>>>> Exchanging to "0" is e.g. used for lll_unlock macro or in
>>>>> malloc_consolidate.
>>>>
>>>> I guess this may then be yet another property atomic-machine.h could
>>>> expose: Whether an atomic fetch-and-and exists and is not implemented
>>>> throough a CAS loop.  Something like lll_unlock would then do an
>>>> exchange or fetch-and-and, and if that fails, a CAS loop.
>>>>
>>>>> Starting with z196 zarch CPUs, the following instructions which are
>>>>> used
>>>>> by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:
>>>>> load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
>>>>> As information:
>>>>> I will update my atomic-machine.h patch to use at least some of the
>>>>> C11
>>>>> atomic builtins or all depending on gcc version.
>>>>
>>>> Thanks.
>>>>
>>>>>> These properties are specific to the architecture, but they are not
>>>>>> specific to the synchronization code (eg, to spinlocks).  Thus, I
>>>>>> think
>>>>>> such settings should be in the atomics headers (i.e., in
>>>>>> atomic-machine.h).
>>>>>> This should probably be a separate patch.  I would propose the
>>>>>> following
>>>>>> macros (both are bool):
>>>>>> /* Nonzero iff all atomic read-modify-write operations (e.g., atomic
>>>>>> exchange) are implemented using a CAS loop.  */
>>>>>> #define ATOMIC_RMW_USES_CAS 1
>>>>> Is one macro enough to describe all read-modify-write operations in
>>>>> include/atomic.h?
>>>>> Please also see my comment above.
>>>>
>>>> Yes, it may not be enough (e.g., you say that s390 may have fetch_and
>>>> but not an exchange, but other archs may just have an exchange but no
>>>> custom fetch_and).
>>>> So maybe we should add ATOMIC_EXCHANGE_USES_CAS and
>>>> ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand?
>>>>
>>>>>> /* Nonzero iff CAS always assumes it will store, and thus has cache
>>>>>> performance effects similar to a store (e.g., it always puts the
>>>>>> respective cacheline into an exclusive state, even if the comparison
>>>>>> against the expected value fails).  */
>>>>>> ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1
>>>>>>
>>>>>> I'm not sure whether there are architectures for which the second
>>>>>> macro
>>>>>> would not be true.  It would be good to investigate this, maybe we
>>>>>> don't
>>>>>> need to add this at all.
>>>>>>
>>>>> We plan that in future gcc will emit e.g. a load-and-test
>>>>> instruction in
>>>>> front of the CAS instruction if the old-value is a constant zero.
>>>>
>>>> That can be useful, but what I outlined would be about a more generic
>>>> case.  If you are going to solve this for your arch in gcc, you might
>>>> not need to address this in this patch.
>>>>
>>>>>>
>>>>>>
>>>>>> For the spin lock specifically, we have the following possibilities:
>>>>>>
>>>>>> 1) If we assume that trylock will most likely succeed in practice:
>>>>>> * We just do an exchange.  The properties above don't matter.
>>>>>>
>>>>>> 2) If we want to bias towards cases where trylock succeeds, but don't
>>>>>> rule out contention:
>>>>>> * If exchange not a CAS loop, and exchange is faster than CAS, do an
>>>>>> exchange.
>>>>>> * If exchange is a CAS loop, use a weak CAS and not an exchange so we
>>>>>> bail out after the first failed attempt to change the state.
>>>>>>
>>>>>> 3) If we expect contention to be likely:
>>>>>> * If CAS always brings the cache line into an exclusive state, then
>>>>>> load
>>>>>> first.  Then do 2).
>>>>>>
>>>>>> Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
>>>>>> buys us anything over 2)?
>>>>>>
>>>>> Yes. Sounds good.
>>>>
>>>> I read this as stating that you agree that 1) doesn't need to be
>>>> considered.  But we still to choose between 2) and 3)  :)
>>>>
>>>> I'm not quite sure what to favor because some of the code out there
>>>> that
>>>> uses trylock just uses it to prevent deadlock (e.g., when trying to
>>>> acquire multiple locks at once), whereas other code uses
>>>> spin_trylock to
>>>> implement their own back-off and bounded spinning.
>>>> I lean towards assuming that lock acquisitons, including the former use
>>>> case for trylock, are supposed to succeed in the common case.  That is,
>>>> I would pick 2), but I have no data to back this up.
>>>>
>>>> Either way, whatever we choose, we should document polish the cases 1-3
>>>> above and the reasoning behind our choice for one of them, and add
>>>> it as
>>>> a comment to the spinlock code.
>>>>
>>>
>>> Okay. I've attached an updated patch. It is now using case 2).
>>> This choice applies to pthread_spin_trylock.c and the first attempt to
>>> acquire the lock in pthread_spin_lock.c.
>>> Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures
>>> in atomic-machine.h files. There is a check in include/atomic.h which
>>> ensures that it is defined to either 0 or 1. Can you please review the
>>> setting of 0 or 1?
>>>
>>> Bye Stefan
>> Ping
>>
>
> Ping
>

PING