This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 3/3] powerpc: Use default st{r,p}cpy optimization for POWER7



On 29-07-2015 11:12, Steven Munroe wrote:
> On Tue, 2015-07-28 at 21:52 -0300, Adhemerval Zanella wrote:
>> Following the discussion with Ondrej and recent changes to default
>> st{r,p}cpy algorithm, this patches uses it for both powerpc64 and
>> powerpc64/power7 instead of optimized ones (which will be removed).
>> This is faster in all but few inputs (mostly with very short sizes) 
>> for benchtests.
>>
>> It removes the default powerpc64 st{r,p}cpy and uses the same
>> optimization, since powerpc64 optimized algorithm only uses a
>> slight optimized path for both doubleword aligned source and
>> destiny and resorting to byte-per-byte access to unaligned inputs.
>>
> 
> Hold off for bit on this. There is some concern that the benchmark used
> to justify this optimization may not be representative. We need time to
> review the code and the benchmark before accepting this change.
> 
> 

Indeed, but without unaligned access (and fast hardware support) I think
the memcpy (... strlen) is still better.  Current POWER7 st{r,p}cpy
algorithm have 3 paths: both source and destiny doubleword aligned,
both source and destiny word aligned and unaligned.  The aligned way
is handled with doubleword/word read/writes in the fast way checking
null with 'cmpb'.  The unaligned uses aligned accesses and shifts.
I tried to remove the doubleword/word aligned paths to remove some
branch predictions and make it use only the unaligned path, and
although the results looks slight better the memcpy (... strlen) is
still faster.

This is due strlen code has only one path and uses aligned access
regardless and slight simpler, requiring less cycles per byte.
And memcpy is also faster and for large string it speedups because
it uses VSX instead of just load/stores.

POWER8 code that uses unaligned access is faster than this strategy
(the bench output I got on a POWER8 machine).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]