This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] correction for PPC __compare_and_swap
- To: dmehaffy at us dot ibm dot com
- Subject: Re: [PATCH] correction for PPC __compare_and_swap
- From: Geoff Keating <geoffk at geoffk dot org>
- Date: Wed, 9 May 2001 17:07:52 -0700
- CC: libc-alpha at sources dot redhat dot com, brianmc1 at us dot ibm dot com
- References: <85256A47.007860B4.00@D51MTA03.pok.ibm.com>
- Reply-to: Geoff Keating <geoffk at redhat dot com>
> From: email@example.com
> cc: firstname.lastname@example.org
> Date: Wed, 9 May 2001 16:53:50 -0500
> While this patch technically does not fix an error in PPC
> __compare_and_swap, it does fix a performance problem with lock/unlock in
> pthreads. This was uncovered when we were fixing bugs in the kernel and
> libraries with Java running on a 4-way SMP Power 3 processor. This change
> in combination with other work did cause Java to function correctly,
> so it may have been a timing issue, but it is easy to argue that this
> should be accepted based on its performance merits alone.
Hmmm. OK, so it's not a correctness change, it's a performance
change. I can go for that.
> It is well known that sync's should be avoided whenever possible on PPC as
> these result in broadcasts to all processor to make sure memory is coherent
> on all processors. This is very expensive. The proposed change will get
> rid of two syncs as written. The isync is local to the processor and does
> not cause any bus traffic between processors and is all that is necessary
> in this case.
> Kaz was on the right track suggesting that the compare and swap should not
> use sync and isync at all but be just an atomic primitive and the calling
> routines should be responsible for doing the sync or isync before or after
> as appropriate.
Yes, I agree with that too.
> Because the routine is used for lock and unlock, it is
> actually better to write them as separate routines, inlining the compare
> and swap which also gets rid of a lwarx/stwcx. pair which are also somewhat
> expensive. The optimimum way to write the routines is as follows (in
> pseudo code):
> unlock: sync
> stw 0, lock_location
> In the unlock case the sync is all that is necessary to make all changes
> protected by the lock globally visible. Note that no lwarx or stwcx. is
> 1: lwarx r5, lock_location
> cmpiw r5, 0
> bne 2f:
> stwcx. 1, lock_location
> bne 1b
> 2: need to indicate the lock is already locked (could spin if you want to
> in this case or put on a sleep queue)
> In the lock case no sync is required at the beginning because all changes
> are visible at this point from the unlocker of the lock. The isync is
> necessary to make sure the processor has done no speculative loads before
> the actual acquire of the lock.
Now I see. The isync is being used as a cheaper, local, sync. That
> Only the processor that is getting the lock needs to get rid of the
> speculative loads because he has the lock, the other processor don't
> care so there is no need to do a sync which would broadcast the
> results to all processor making them invalidate their data.
> This is the implementation we have been using in AIX since the first
> introduction of PowerPC SMPs. It has been found to be the most optimal
> path length and cycle count for implementation of locks. In this form
> there is only one sync and one isync for a lock/unlock pair as compared to
> the 4 syncs (which are very heavy weight) in the current code. Also there
> is the elimination of a lwarx/stwcx. pair as well which cause bus
> transactions which are expensive because of the reservation protocol.
I can't work out how to get the library to stop using compare_and_swap
to release locks... Any ideas? I don't want to touch the
machine-independent part if I can help it.
> I would be happy to supply a formal patch that implements locks this way
> for PPC (though it is very simple to implement from the pseudo code). Also
> to note this is exactly how the sample lock code supplied in the PowerPC
> Architecture book (page 254) and the PowerPC Microprocessor Family: The
> Programming Environments (page E4) suggest that locks be implemented for
> optimal performance.
I'd be very happy if people could test this patch on an SMP system and
see how it does.
- Geoffrey Keating <email@example.com>
2001-05-09 Geoff Keating <firstname.lastname@example.org>
(__compare_and_swap): Remove memory barriers.
(__compare_and_swap_with_release_semantics): New function.
RCS file: /cvs/glibc/libc/linuxthreads/sysdeps/powerpc/pt-machine.h,v
retrieving revision 1.7
diff -p -u -u -p -r1.7 pt-machine.h
--- pt-machine.h 2000/07/22 02:24:23 1.7
+++ pt-machine.h 2001/05/10 00:06:21
@@ -1,6 +1,6 @@
/* Machine-dependent pthreads configuration and inline functions.
- Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
+ Copyright (C) 1996, 1997, 1998, 2000, 2001 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
@@ -38,20 +38,14 @@ register char * stack_pointer __asm__ ("
/* Compare-and-swap for semaphores. */
/* note that test-and-set(x) is the same as !compare-and-swap(x, 0, 1) */
__compare_and_swap (long int *p, long int oldval, long int newval)
- MEMORY_BARRIER ();
__asm__ __volatile__ (
"0: lwarx %0,0,%1 ;"
" xor. %0,%3,%0;"
@@ -62,6 +56,26 @@ __compare_and_swap (long int *p, long in
: "r"(p), "r"(newval), "r"(oldval)
: "cr0", "memory");
+ __asm__ __volatile__ ("isync" : : : "memory");
+ return ret == 0;
+__compare_and_swap_with_release_semantics (long int *p,
+ long int oldval, long int newval)
+ int ret;
+ __asm__ __volatile__ (
+ "0: lwarx %0,0,%1 ;"
+ " xor. %0,%3,%0;"
+ " bne 1f;"
+ " stwcx. %2,0,%1;"
+ " bne- 0b;"
+ "1: "
+ : "=&r"(ret)
+ : "r"(p), "r"(newval), "r"(oldval)
+ : "cr0", "memory");
return ret == 0;