This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 1/2] S390: Optimize atomic macros.
- From: Torvald Riegel <triegel at redhat dot com>
- To: Stefan Liebler <stli at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Thu, 24 Nov 2016 13:51:05 +0100
- Subject: Re: [PATCH 1/2] S390: Optimize atomic macros.
- Authentication-results: sourceware.org; auth=none
- References: <1479913753-20506-1-git-send-email-stli@linux.vnet.ibm.com>
On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> The atomic_compare_and_exchange_val_acq macro is now implemented with
> gcc __sync_val_compare_and_swap instead of an inline assembly with
> compare-and-swap instruction.
Can you use the new GCC atomic builtins, or are they still insufficient
on s390?
Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
Generally, this is where we want to get to, so that we have to maintain
as few atomic operations as possible.
Also note that we'll be phasing out the old-style glibc atomics
eventually and use the new C11-like atomics.
> The memory is compared against expected OLDVAL before using compare-and-swap
> instruction in case of OLDVAL is constant at compile time. This is used in
> various locking code. If the lock is already aquired, another cpu has not to
> exclusively lock the memory. If OLDVAL is not constant the compare-and-swap
> instruction is used directly as the usages of this macro usually load the
> current value in front of this macro.
Generally, I think the compiler should do these optimizations (under the
assumption that we drive these using builtin_expect and the like). If
GCC not doing these (with new _atomic* builtins), is there a patch about
that or at least a BZ? I'd be okay with accepting an intermediate
solution that optimizes atomics in glibc, but I'd want to make this
clear in code comments and reference the GCC BZ that makes the
improvement request; this ensures that we can drop the glibc hack once
GCC does what you want it to do.
For this particular case, I'm hoping we can just fix this in the
compiler.
Regarding your patch, the compiler barrier is in the wrong position for
an acquire MO operation; for the new C11-like atomics, it would be
unnecessary because we only guarantee relaxed MO in the failure case.
> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
> defined before. Now it is implemented with gcc __sync_bool_compare_and_swap.
> If the macro is used as condition in an if/while expression, the condition
> code is used to e.g. jump directly to another code sequence. Before this
> change, the old value returned by compare-and-swap instruction was compared
> with given OLDVAL to determine if e.g. a jump is needed.
>
> The atomic_exchange_acq macro is now using the load-and-and instruction for a
> constant zero value instead of a compare-and-swap loop. This instruction is
> available on a z196 zarch and higher cpus. This is e.g. used in unlocking code.
See above. This should be fixed in the compiler.
> The newly defined atomic_exchange_and_add macro is implemented with gcc
> builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
> and higher cpus instead of a loop with compare-and-swap instruction.
> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
> the appropiate z196 instruction.
Can you just use the new _atomic builtins? The compiler should just
give use _atomic* builtins that are optimized, and we should use them.
With that in place, we could just implement the old-style glibc atomic
operations based on the _atomic* builtins, and have much less code to
maintain.
> The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
> which does not use jumps in case the lock is free. Without the hint it had
> to jump if the lock was free.
I think it's debatable whether trylock should expect the lock to be
free. OTOH, if it's not free, I guess that we should be in the slow
path in most use cases anyway. Either way, I think it's a choice we
should make generically for all architectures, and I want to keep
arch-specific code to a minimum. Thus, please split this out and
propose it for the generic lowlevellock.