This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.


On 03/14/2017 04:55 PM, Stefan Liebler wrote:
On 02/20/2017 02:51 PM, Torvald Riegel wrote:
On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote:
A while ago we tried hard to remove all code that would fail silently
when a macro had a typo in the name, for example.  We still have a
few
of them left (e.g., in the atomics), but I think this here doesn't
warrant an exception.

Okay. You're right.

In comment of patch 2/2, you've mentioned a header where an
architecture
shall define those parameters.
Thus I've added the file nptl/pthread_spin_parameters.h.

I think the stub version should be in sysdeps/ but the wiki is missing
this information (default ENOTSUP location):
https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location

In case we need a new header, shall we move it to sysdeps/nptl/ ?

I would guess that this would be the right place for a stub / generic
variant of such a header, but I'm not quite sure.


I'm still a bit undecided about how to deal with other arch's
pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
and then include the generic pthread_spin_lock.c.  Maybe it's best
if we
just remove them in this patch of yours.

I think the archs currently using the generic implementation have just
copied the default value to get rid of the warning "machine-dependent
file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only an
assumption.
In general removing those "wrapper"-pthread_spin_lock.c files and using
information from a header like the proposed pthread_spin_parameters.h or
atomic-machine.h is a good approach.

Okay.  So let's do that.

Done. The wrapper-pthread_spin_lock.c files in sysdeps subfolders are
removed and the option SPIN_LOCK_READS_BETWEEN_CMPXCHG is not available
anymore.
Instead there is only a loop of plain reads until we observe an not
acquired lock.
See comment in nptl/pthread_spin_lock.c.

I agree that the there are architecture-specific properties of atomic
operations that have a significant effect on performance.  For s390,
you
list the following points:
* atomic exchange (I suppose all atomic read-modify-write ops?) are
implemented through CAS.
atomic exchange is implemented by a CAS loop due to lack of an exchange
instruction. For exchanging to a "0", s390 can use the load-and-and
instruction instead of a CAS loop (See my atomic-machine.h patch; For
gcc, we plan to emit such a load-and-and instruction for builtin
__atomic_exchange_n in future;). This also saves one register usage
compared to the CAS loop.
Exchanging to "0" is e.g. used for lll_unlock macro or in
malloc_consolidate.

I guess this may then be yet another property atomic-machine.h could
expose: Whether an atomic fetch-and-and exists and is not implemented
throough a CAS loop.  Something like lll_unlock would then do an
exchange or fetch-and-and, and if that fails, a CAS loop.

Starting with z196 zarch CPUs, the following instructions which are used
by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:
load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
As information:
I will update my atomic-machine.h patch to use at least some of the C11
atomic builtins or all depending on gcc version.

Thanks.

These properties are specific to the architecture, but they are not
specific to the synchronization code (eg, to spinlocks).  Thus, I think
such settings should be in the atomics headers (i.e., in
atomic-machine.h).
This should probably be a separate patch.  I would propose the
following
macros (both are bool):
/* Nonzero iff all atomic read-modify-write operations (e.g., atomic
exchange) are implemented using a CAS loop.  */
#define ATOMIC_RMW_USES_CAS 1
Is one macro enough to describe all read-modify-write operations in
include/atomic.h?
Please also see my comment above.

Yes, it may not be enough (e.g., you say that s390 may have fetch_and
but not an exchange, but other archs may just have an exchange but no
custom fetch_and).
So maybe we should add ATOMIC_EXCHANGE_USES_CAS and
ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand?

/* Nonzero iff CAS always assumes it will store, and thus has cache
performance effects similar to a store (e.g., it always puts the
respective cacheline into an exclusive state, even if the comparison
against the expected value fails).  */
ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1

I'm not sure whether there are architectures for which the second macro
would not be true.  It would be good to investigate this, maybe we
don't
need to add this at all.

We plan that in future gcc will emit e.g. a load-and-test instruction in
front of the CAS instruction if the old-value is a constant zero.

That can be useful, but what I outlined would be about a more generic
case.  If you are going to solve this for your arch in gcc, you might
not need to address this in this patch.



For the spin lock specifically, we have the following possibilities:

1) If we assume that trylock will most likely succeed in practice:
* We just do an exchange.  The properties above don't matter.

2) If we want to bias towards cases where trylock succeeds, but don't
rule out contention:
* If exchange not a CAS loop, and exchange is faster than CAS, do an
exchange.
* If exchange is a CAS loop, use a weak CAS and not an exchange so we
bail out after the first failed attempt to change the state.

3) If we expect contention to be likely:
* If CAS always brings the cache line into an exclusive state, then
load
first.  Then do 2).

Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
buys us anything over 2)?

Yes. Sounds good.

I read this as stating that you agree that 1) doesn't need to be
considered.  But we still to choose between 2) and 3)  :)

I'm not quite sure what to favor because some of the code out there that
uses trylock just uses it to prevent deadlock (e.g., when trying to
acquire multiple locks at once), whereas other code uses spin_trylock to
implement their own back-off and bounded spinning.
I lean towards assuming that lock acquisitons, including the former use
case for trylock, are supposed to succeed in the common case.  That is,
I would pick 2), but I have no data to back this up.

Either way, whatever we choose, we should document polish the cases 1-3
above and the reasoning behind our choice for one of them, and add it as
a comment to the spinlock code.


Okay. I've attached an updated patch. It is now using case 2).
This choice applies to pthread_spin_trylock.c and the first attempt to
acquire the lock in pthread_spin_lock.c.
Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures
in atomic-machine.h files. There is a check in include/atomic.h which
ensures that it is defined to either 0 or 1. Can you please review the
setting of 0 or 1?

Bye Stefan


I link the post of Torvald Riegel in this spinlock mailthread, so we have all the discussions here. See
"Re: [PATCH v3 0/6] Add support for ISO C11 threads.h"
(https://www.sourceware.org/ml/libc-alpha/2017-03/msg00615.html):

Current DR log: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2109.htm

470:
C11's requirements on implementations are weaker than what POSIX
requires. Arguably, POSIX should require what C11 requires.
We need to check the lowlevellock implementation, in particular on archs
that use LL/SC (I'm not aware of an arch with a true CAS instruction
that can fail spuriously).
Our generic lowlevellock implementation on master still uses the
old-style atomics (ie, atomic_compare_and_exchange_bool_acq); if we move
to C11 atomics, C11 (with DR 470's proposed corrigendum applied) would
require our atomic_compare_exchange_weak_acquire, whereas current POSIX
would require the strong variant of the CAS.
Thus, what we do is okay in terms of C11.
I don't recall right now whether we have an open Austin Group bug about
POSIX allowing a spurious failure; I believe we have, or we should,
because this is also related to what memory order is required for lock
acquisitions (generally, not just trylock).
(Stefan, that's why I'm CC'ing you too, this is relevant for
pthread_spin_trylock too, but I can't remember right now whether we
already discussed this or not.)

Thus you mean we have to use a "atomic_compare_exchange_strong_acquire" in pthread_spin_trylock due to POSIX?

I've found your "Bug 17428 - lll_trylock barrier semantics when lock was not acquired differ between architectures" (https://sourceware.org/bugzilla/show_bug.cgi?id=17428) and an older post: "Re: [PATCHv2] powerpc: Spinlock optimization and cleanup" (https://sourceware.org/ml/libc-alpha/2015-10/msg00012.html).

Bye
Stefan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]