This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PowerPC: memset optimization for POWER8/PPC64

From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
To: libc-alpha at sourceware dot org
Date: Mon, 21 Jul 2014 10:17:07 -0300
Subject: Re: PowerPC: memset optimization for POWER8/PPC64
Authentication-results: sourceware.org; auth=none
References: <53C920CD dot 8030506 at linux dot vnet dot ibm dot com> <53C94952 dot 4010805 at twiddle dot net>

Hi Richard, 

Thanks for the review.

On 18-07-2014 13:20, Richard Henderson wrote:
> On 07/18/2014 06:27 AM, Adhemerval Zanella wrote:
>> +	andi.	r11,r10,r15	/* Check alignment of DST.  */
> s/r15/15/
>
> I had to read that line several times before I noticed the I in ANDI, and that
> this wasn't in fact a read of the uninitialzed r15.  (Stupid ppc
> non-enforcement of registers vs integers syntax...)

Thanks, I have fixed. 

>
>> +	mtocrf	0x01,r0
>> +	clrldi	r0,r0,60
>> +
>> +	/* Get DST aligned to 16 bytes.  */
>> +1:	bf	31,2f
>> +	stb	r4,0(r10)
>> +	addi	r10,r10,1
>> +
>> +2:	bf	30,4f
>> +	sth	r4,0(r10)
>> +	addi	r10,r10,2
>> +
>> +4:	bf	29,8f
>> +	stw	r4,0(r10)
>> +	addi	r10,r10,4
>> +
>> +8:	bf      28,16f
>> +	std     r4,0(r10)
>> +	addi    r10,r10,8
>> +
>> +16:	subf	r5,r0,r5
> As clever as this is, surely it is less efficient than using the unaligned
> store hardware.  You know that there are at least 32 bytes to be written; you
> could just do two unaligned std and then realign.

In fact, in this case it will need to write 1-15 bytes based 'clrldi' result.  And
for POWER8, although unaligned store are handled with performance equivalence of
aligned ones, some cases POWER8 will either:

* force break unaligned in multiple internal operations (misaligned flushes in a
  crossing 128-byte cache-line boundaries and storing a 4KB small page boundary);

* trigger an alignment interrupt in caching-inhibited storage.  This is why I have
  pushed the patch 87868c2418fb74357757e3b739ce5b76b17a8929 on memcpy: if you use
  memcpy on a DMA mapped memory (from a GPU for instance), doing *any* unaligned
  store will result in alignment interrupt.  And I got reports that the X server
  is doing it (that's why the patch).

So I think the performance different here to avoid such traps is worthy.

>
>> +	/* Write remaining 1~31 bytes.  */
>> +	.align  4
>> +L(tail_bytes):
>> +	beqlr   cr6
>> +
>> +	srdi    r7,r11,4
>> +	clrldi  r8,r11,60
>> +	mtocrf  0x01,r7
> Likewise.
>
>
> r~
>

Follow-Ups:
- Re: PowerPC: memset optimization for POWER8/PPC64
  - From: Richard Henderson

References:
- PowerPC: memset optimization for POWER8/PPC64
  - From: Adhemerval Zanella
- Re: PowerPC: memset optimization for POWER8/PPC64
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]