[PATCH] powerpc64le: add optimized strlen for P9

Fri May 29 16:26:14 GMT 2020

V3 is attached with changes to formatting and a couple of 
simplifications as noted below.

On 5/27/20 11:45 AM, Paul A. Clarke wrote:
> On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote:

>> +/* Implements the function
>> +
>> +   int [r3] strlen (void *s [r3])
> 
> const void *s?

Fixed, alongside folding away the mr r3,r4.  Likewise, the basic GNU 
formatting requests, and removed some of the more redundant ones.  Thank 
you for the suggested changes.

>> +	.p2align 5
>> +L(loop_64b):
>> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
>> +	lxv	  v2+32, 16(r4)
>> +	lxv	  v3+32, 32(r4)
>> +	lxv	  v4+32, 48(r4)
>> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
>> +	vminub	  v6,v3,v4
>> +	vminub	  v7,v5,v6
>> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
>> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
>> +	bne	  cr6,L(vmx_zero)
>> +
>> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
>> +	lxv	  v2+32, 16(r4)
>> +	lxv	  v3+32, 32(r4)
>> +	lxv	  v4+32, 48(r4)
>> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
>> +	vminub	  v6,v3,v4
>> +	vminub	  v7,v5,v6
>> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
>> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
>> +	bne	  cr6,L(vmx_zero)
>> +
>> +	lxv	  v1+32, 0(r4)  /* Load 4 quadwords.  */
>> +	lxv	  v2+32, 16(r4)
>> +	lxv	  v3+32, 32(r4)
>> +	lxv	  v4+32, 48(r4)
>> +	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
>> +	vminub	  v6,v3,v4
>> +	vminub	  v7,v5,v6
>> +	vcmpequb. v7,v7,v18  /* Check for NULLs.  */
>> +	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
>> +	beq	  cr6,L(loop_64b)
> 
> Curious how much this loop unrolling helps, since it adds a fair bit of
> redundant code?

It does seem to help a little bit, though maybe just an artifact of the 
benchsuite.

> 
>> +
>> +L(vmx_zero):
>> +	/* OK, we found a null byte.  Let's look for it in the current 64-byte
>> +	   block and mark it in its corresponding VR.  */
>> +	vcmpequb  v1,v1,v18
>> +	vcmpequb  v2,v2,v18
>> +	vcmpequb  v3,v3,v18
>> +	vcmpequb  v4,v4,v18
>> +
>> +	/* We will now 'compress' the result into a single doubleword, so it
>> +	   can be moved to a GPR for the final calculation.  First, we
>> +	   generate an appropriate mask for vbpermq, so we can permute bits into
>> +	   the first halfword.  */
> 
> I'm wondering (without having verified) if you can do something here akin to
> what's done in the "tail" sections below, using "vctzlsbb".

It does not help when the content spans more than 1 VR.  I don't think 
there is much to improve for a 64b mask reduction.  Though, we can save 
a couple cycles below using cnttzd (new in ISA 3.0).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-powerpc64le-add-optimized-strlen-for-P9.patch
Type: text/x-patch
Size: 10145 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20200529/a55d3596/attachment.bin>