This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7


On 16-09-2013 18:45, OndÅej BÃlka wrote:
> On Mon, Sep 16, 2013 at 04:22:27PM -0300, Adhemerval Zanella wrote:
>> On 16-09-2013 15:33, OndÅej BÃlka wrote:
> They are simpler to keep in sync than having several implementations
> that duplicate same logic.

I tend to agree, but for POWER there are a lot of scheduling tinkering that may ending in
a lot of ifdefs. I may get back to a common code base when I start to analyze the strlen
as well. Right now I want to focus on the optimization only, instead of a refactor plus
optimization.


>
>>> Also I noted in implementation that provided that if we handle case of
>>> writing less than 8 bytes separately a best way how finish on x64 would
>>> be compute end and do overlapping store of last 8 bytes.
>>>
>>> There are two things I do not know, first one is computing end, on wikipedia
>>> I found that it this could be handled by cntlz on mask and dividing it by 8.
>> I'm not sure if I understood your questioning. By handling '8 bytes separately'
>> you mean what exactly? The patch added the doubleword case for aligned strings
>> where for default algorithm and POWER7 implementation both will exit in first
>> check after extrdi. check.
>>
> I mean implementations like this one, I wrote this mostly to show
> unaligned store I there are several additional optimizations possible.
>
> char *strppy(char *x,char *y){
>   uint64_t ones = 0x0101010101010101
>   uint64_t mask = 80 * ones;
>
>   /* handle strings less than 8 bytes and align x */
>
>   while (1) 
>   {
>     #define has_zero(x) ((((uint64_t(*x) | mask) - ones) & mask) ^ mask
>     if (has_zero(x))
>       {
>         int first = ffs (has_zero (x)) / 8;
>         *((uint64_t)(x - 8 + first + 1)) = *((uint64_t)(y - 8 + first + 1));
>         return x + first;
>       }
>     *((uint64_t)(x)) = *((uint64_t)(y));
>     x += 8;
>     y += 8;
>   }
> }
>
>
> uint64_t extract = 0x0102040810204080;

Well, besides the fact the algorithm seems wrong (I tried and it failed o test-stpcpy, but
I understand your did as a scratch), the idea of my patch is not optimize *unaligned* access,
but rather the aligned ones. I plan to investigate the aligned case later, since on POWER
unaligned accesses are costly.


>
>> For general implementation the algorithm used to find '0' bytes is explain at
>> sysdeps/powerpc/powerpc64/stplen.S. I decided to use the first option on inner
>> loops as the implementation already uses it for words. For POWER7 the algorithm
>> is simpler and the only thing we need to do is shift and check for NULLs after
>> the loop. 
>>
> There is separate issue that a strlen code is little ineffective. In
> that loop one branch is retundant. A better implementation has following
> loop. 
> I optimized this for asymptotic complexity, extracting length could
> be made more effectively in particular cases.
>
> size_t strlen(char *x)
> {
>   uint64_t ones = 0x0101010101010101
>   uint64_t mask = 80 * ones;
>   uith64_t extract = 0x0102040810204080;
>
>   /* handle first 16 bytes */
>
>   char *x0 = x;
>
>   /* align x */
>
>   while (1) 
>   {
>     uint64_t m0 = (*(uint64_t)(x)) | mask - ones;
>     uint64_t m8 = (*(uint64_t)(x + 8)) | mask - ones;
>     uint64_t r0,r8;
>     if (r8 = ( (m0 & m8 & mask) ^ mask))
>       {
>         uint64_t r0 = (m0 & mask) ^ mask;
>         int bitmask0 = ((r0 >> 7) * extract) >> 56;
>         int bitmask8 = ((r8 >> 7) * extract) >> 56;
>         int first = ffs (bitmask0 | (bitmask8 << 8));
>         return x - x0 + first;
>       }
>     x += 16;
>   }
> }

Again, this assumes unaligned cases are not costly, which is not the case for POWER. I'll
investigate it later, but if you check on strlen optimization for POWER7 there is a *lot*
of care of *not* doing unaligned memory accesses. This is case for all string optimizations
for POWER7.


>
>>> Second is how slow are overlapping stores versus branch misprediction.
>>> You need a benchmark that will vary sizes to check this, I could supply
>>> one.
>>>
>> You can also add the benchmark on existing GLIBC infrastructure. For POWER7, the
>> branch misprediction is quite smart and form my experience trying to use branch
>> prediction instruction usually is not a good idea.
> I am not interested on branch prediction instructions but on cost of
> mispredictions, in particular what is difference in time of following benchmark
>  when we switch loops by -DRAND
>
For current implementation none, for my patch I see an improvement of 60% for the case
without DRAND (DRAND time does not change).


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]