This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.


On 03/22/2017 01:56 PM, Szabolcs Nagy wrote:
On 21/03/17 15:43, Stefan Liebler wrote:
On 03/14/2017 04:55 PM, Stefan Liebler wrote:
Okay. I've attached an updated patch. It is now using case 2).
This choice applies to pthread_spin_trylock.c and the first attempt to
acquire the lock in pthread_spin_lock.c.
Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures
in atomic-machine.h files. There is a check in include/atomic.h which
ensures that it is defined to either 0 or 1. Can you please review the
setting of 0 or 1?

Bye Stefan
Ping


the aarch64 changes look ok to me (but this is
something that ideally would be benchmarked on real
hw with interesting workload and i haven't done that
because it is non-trivial)

on a trivial benchmark it seems to be a bit better
than the current code.

This sounds good.

the performance of the unconteded case can be improved
slightly by reverting the unlock change (the release
store is stronger than the barrier was, conceptually
there is a barrier before and after an armv8 release
store to prevent an independent load-acquire to get
reordered with it in either direction)

Thus you mean something like the following?
  atomic_thread_fence_release ();
  atomic_store_relaxed (lock, 0);
(Info: I've used scripts/build-many-glibcs.py to get the following objdumps. The sysdeps/powerpc/nptl/pthread_spin_unlock.c is using atomic_store_release, too. For the following powerpc64-linux-gnu objdumps, I've removed the powerpc-spinlock implementation to see the differences)
=>aarch64-linux-gnu
0000000000000000 <pthread_spin_unlock>:
   0:   d5033bbf        dmb     ish
   4:   b900001f        str     wzr, [x0]
   8:   52800000        mov     w0, #0x0                        // #0
   c:   d65f03c0        ret
=>powerpc64-linux-gnu:
0000000000000000 <.pthread_spin_unlock>:
   0:	7c 69 1b 78 	mr      r9,r3
   4:	7c 20 04 ac 	lwsync
   8:	39 40 00 00 	li      r10,0
   c:	38 60 00 00 	li      r3,0
  10:	91 49 00 00 	stw     r10,0(r9)
  14:	4e 80 00 20 	blr

Here is the upstream code as comparison:
  atomic_full_barrier ();
  *lock = 0;
=>aarch64-linux-gnu
0000000000000000 <pthread_spin_unlock>:
   0:   aa0003e1        mov     x1, x0
   4:   d5033bbf        dmb     ish
   8:   52800000        mov     w0, #0x0                        // #0
   c:   b900003f        str     wzr, [x1]
  10:   d65f03c0        ret
=>powerpc64-linux-gnu:
0000000000000000 <.pthread_spin_unlock>:
   0:	7c 69 1b 78 	mr      r9,r3
   4:	7c 00 04 ac 	hwsync
   8:	39 40 00 00 	li      r10,0
   c:	38 60 00 00 	li      r3,0
  10:	91 49 00 00 	stw     r10,0(r9)
  14:	4e 80 00 20 	blr

And the code of my latest patch:
  atomic_store_release (lock, 0);
=>aarch64-linux-gnu
0000000000000000 <pthread_spin_unlock>:
   0:   889ffc1f        stlr    wzr, [x0]
   4:   52800000        mov     w0, #0x0                        // #0
   8:   d65f03c0        ret
=>powerpc64-linux-gnu:
0000000000000000 <.pthread_spin_unlock>:
   0:	7c 69 1b 78 	mr      r9,r3
   4:	7c 20 04 ac 	lwsync
   8:	39 40 00 00 	li      r10,0
   c:	38 60 00 00 	li      r3,0
  10:	91 49 00 00 	stw     r10,0(r9)
  14:	4e 80 00 20 	blr

power consumption of a contended spin lock on armv8
can be improved using a send-event/wait-event mechanism,
but then the atomic_spin_nop needs to be in a loop with
an ll/sc pair not with a relaxed load.
(i guess we can introduce a target specific spinlock
if this turns out to be relevant)


(git apply complained about an extra newline at the
end of atomic.h)

Oops. I'll fix it in the next version.

Thanks.
Stefan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]