This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: [PATCH] correction for PPC __compare_and_swap

While this patch technically does not fix an error in PPC
__compare_and_swap, it does fix a performance problem with lock/unlock in
pthreads.  This was uncovered when we were fixing bugs in the kernel and
libraries with Java running on a 4-way SMP Power 3 processor. This change
in combination with other work did cause Java to function correctly,
so it may have been a timing issue, but it is easy to argue that this
should be accepted based on its performance merits alone.

It is well known that sync's should be avoided whenever possible on PPC  as
these result in broadcasts to all processor to make sure memory is coherent
on all processors.  This is very expensive.  The proposed change will get
rid of two syncs as written.  The isync is local to the processor and does
not cause any bus traffic between processors and is all that is necessary
in this case.

Kaz was on the right track suggesting that the compare and swap should not
use sync and isync at all but be just an atomic primitive and the calling
routines should be responsible for doing the sync or isync before or after
as appropriate.  Because the routine is used for lock and unlock,  it is
actually better to write them as separate routines, inlining the compare
and swap which also gets rid of a lwarx/stwcx. pair which are also somewhat
expensive.   The optimimum way to write the routines is as follows (in
pseudo code):

unlock:  sync
         stw 0, lock_location

In the unlock case the sync is all that is  necessary to make all changes
protected by the lock globally visible.  Note that no lwarx or stwcx. is

1:   lwarx   r5, lock_location
     cmpiw r5, 0
     bne 2f:
     stwcx. 1, lock_location
     bne 1b
2:   need to indicate the lock is already locked (could spin if you want to
     in this case or put on a sleep queue)

In the lock case no sync is required at the beginning because all changes
are visible at this point from the unlocker of the lock.  The isync is
necessary to make sure the processor has done no speculative loads before
the actual acquire of the lock.  Only the processor that is getting the
lock needs to get rid of the speculative loads because he has the lock, the
other processor don't care so there is no need to do a sync which would
broadcast the results to all processor making them invalidate their data.

This is the implementation we have been using in AIX since the first
introduction of PowerPC SMPs.  It has been found to be the most optimal
path length and cycle count for implementation of locks.  In this form
there is only one sync and one isync for a lock/unlock pair as compared to
the 4 syncs (which are very heavy weight) in the current code.  Also there
is the elimination of a lwarx/stwcx. pair as well which cause bus
transactions which are expensive because of the reservation protocol.

I would be happy to supply a formal patch that implements locks this way
for PPC (though it is very simple to implement from the pseudo code).  Also
to note this is exactly how the sample lock code supplied in the PowerPC
Architecture book (page 254) and the PowerPC Microprocessor Family: The
Programming Environments (page E4) suggest that locks be implemented for
optimal performance.


Dr. David W. Mehaffy                                   Senior Technical
Staff Member
AIX System Architecture & Linux Affinity
Voice:  512-838-3476  Tie: 678-3476
B/905-8H006  MS/9586
11400 Burnet Road
Austin, TX 78758

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]