This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] PowerPC: stpcpy optimization for PPC64/POWER7


On 16-09-2013 15:33, OndÅej BÃlka wrote:
> On Mon, Sep 16, 2013 at 11:30:06AM -0300, Adhemerval Zanella wrote:
>> Hi all,
>>
>> Following Alan Modra suggestion, it is a stpcpy optimization patch for PPC64.
>> This patch optimizes the default PPC64 by adding doubleword stores/loads
>> increasing aligned throughput for large sizes.
>>
> A obvious question here is why it needs be keept separate from strcpy
> implementation. A sysdeps/powerpc/powerpc64/st[pr]cpy.S are quite
> similar and I do not see a sysdeps/powerpc/powerpc64/power7/strcpy.S
> file. Would same optimization apply to strcpy?

Thanks for the review OndÅej. I didn't check on strcpy yet, but I plan to. Regarding
implementation, indeed stpcpy and strcpy are quite similar, but I think assembly
implementation are already more difficult to maintain and throwing a lot of ifdef
won't make any easier.

>
> Also I noted in implementation that provided that if we handle case of
> writing less than 8 bytes separately a best way how finish on x64 would
> be compute end and do overlapping store of last 8 bytes.
>
> There are two things I do not know, first one is computing end, on wikipedia
> I found that it this could be handled by cntlz on mask and dividing it by 8.

I'm not sure if I understood your questioning. By handling '8 bytes separately'
you mean what exactly? The patch added the doubleword case for aligned strings
where for default algorithm and POWER7 implementation both will exit in first
check after extrdi. check.

For general implementation the algorithm used to find '0' bytes is explain at
sysdeps/powerpc/powerpc64/stplen.S. I decided to use the first option on inner
loops as the implementation already uses it for words. For POWER7 the algorithm
is simpler and the only thing we need to do is shift and check for NULLs after
the loop. 

> Second is how slow are overlapping stores versus branch misprediction.
> You need a benchmark that will vary sizes to check this, I could supply
> one.
>
You can also add the benchmark on existing GLIBC infrastructure. For POWER7, the
branch misprediction is quite smart and form my experience trying to use branch
prediction instruction usually is not a good idea.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]