This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: PowerPC: memset optimization for POWER8/PPC64
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Mon, 21 Jul 2014 10:17:07 -0300
- Subject: Re: PowerPC: memset optimization for POWER8/PPC64
- Authentication-results: sourceware.org; auth=none
- References: <53C920CD dot 8030506 at linux dot vnet dot ibm dot com> <53C94952 dot 4010805 at twiddle dot net>
Hi Richard,
Thanks for the review.
On 18-07-2014 13:20, Richard Henderson wrote:
> On 07/18/2014 06:27 AM, Adhemerval Zanella wrote:
>> + andi. r11,r10,r15 /* Check alignment of DST. */
> s/r15/15/
>
> I had to read that line several times before I noticed the I in ANDI, and that
> this wasn't in fact a read of the uninitialzed r15. (Stupid ppc
> non-enforcement of registers vs integers syntax...)
Thanks, I have fixed.
>
>> + mtocrf 0x01,r0
>> + clrldi r0,r0,60
>> +
>> + /* Get DST aligned to 16 bytes. */
>> +1: bf 31,2f
>> + stb r4,0(r10)
>> + addi r10,r10,1
>> +
>> +2: bf 30,4f
>> + sth r4,0(r10)
>> + addi r10,r10,2
>> +
>> +4: bf 29,8f
>> + stw r4,0(r10)
>> + addi r10,r10,4
>> +
>> +8: bf 28,16f
>> + std r4,0(r10)
>> + addi r10,r10,8
>> +
>> +16: subf r5,r0,r5
> As clever as this is, surely it is less efficient than using the unaligned
> store hardware. You know that there are at least 32 bytes to be written; you
> could just do two unaligned std and then realign.
In fact, in this case it will need to write 1-15 bytes based 'clrldi' result. And
for POWER8, although unaligned store are handled with performance equivalence of
aligned ones, some cases POWER8 will either:
* force break unaligned in multiple internal operations (misaligned flushes in a
crossing 128-byte cache-line boundaries and storing a 4KB small page boundary);
* trigger an alignment interrupt in caching-inhibited storage. This is why I have
pushed the patch 87868c2418fb74357757e3b739ce5b76b17a8929 on memcpy: if you use
memcpy on a DMA mapped memory (from a GPU for instance), doing *any* unaligned
store will result in alignment interrupt. And I got reports that the X server
is doing it (that's why the patch).
So I think the performance different here to avoid such traps is worthy.
>
>> + /* Write remaining 1~31 bytes. */
>> + .align 4
>> +L(tail_bytes):
>> + beqlr cr6
>> +
>> + srdi r7,r11,4
>> + clrldi r8,r11,60
>> + mtocrf 0x01,r7
> Likewise.
>
>
> r~
>