[PATCH] powerpc64le: add optimized strlen for P9
Paul E Murphy
murphyp@linux.ibm.com
Fri May 29 16:26:14 GMT 2020
V3 is attached with changes to formatting and a couple of
simplifications as noted below.
On 5/27/20 11:45 AM, Paul A. Clarke wrote:
> On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote:
>> +/* Implements the function
>> +
>> + int [r3] strlen (void *s [r3])
>
> const void *s?
Fixed, alongside folding away the mr r3,r4. Likewise, the basic GNU
formatting requests, and removed some of the more redundant ones. Thank
you for the suggested changes.
>> + .p2align 5
>> +L(loop_64b):
>> + lxv v1+32, 0(r4) /* Load 4 quadwords. */
>> + lxv v2+32, 16(r4)
>> + lxv v3+32, 32(r4)
>> + lxv v4+32, 48(r4)
>> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
>> + vminub v6,v3,v4
>> + vminub v7,v5,v6
>> + vcmpequb. v7,v7,v18 /* Check for NULLs. */
>> + addi r4,r4,64 /* Adjust address for the next iteration. */
>> + bne cr6,L(vmx_zero)
>> +
>> + lxv v1+32, 0(r4) /* Load 4 quadwords. */
>> + lxv v2+32, 16(r4)
>> + lxv v3+32, 32(r4)
>> + lxv v4+32, 48(r4)
>> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
>> + vminub v6,v3,v4
>> + vminub v7,v5,v6
>> + vcmpequb. v7,v7,v18 /* Check for NULLs. */
>> + addi r4,r4,64 /* Adjust address for the next iteration. */
>> + bne cr6,L(vmx_zero)
>> +
>> + lxv v1+32, 0(r4) /* Load 4 quadwords. */
>> + lxv v2+32, 16(r4)
>> + lxv v3+32, 32(r4)
>> + lxv v4+32, 48(r4)
>> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
>> + vminub v6,v3,v4
>> + vminub v7,v5,v6
>> + vcmpequb. v7,v7,v18 /* Check for NULLs. */
>> + addi r4,r4,64 /* Adjust address for the next iteration. */
>> + beq cr6,L(loop_64b)
>
> Curious how much this loop unrolling helps, since it adds a fair bit of
> redundant code?
It does seem to help a little bit, though maybe just an artifact of the
benchsuite.
>
>> +
>> +L(vmx_zero):
>> + /* OK, we found a null byte. Let's look for it in the current 64-byte
>> + block and mark it in its corresponding VR. */
>> + vcmpequb v1,v1,v18
>> + vcmpequb v2,v2,v18
>> + vcmpequb v3,v3,v18
>> + vcmpequb v4,v4,v18
>> +
>> + /* We will now 'compress' the result into a single doubleword, so it
>> + can be moved to a GPR for the final calculation. First, we
>> + generate an appropriate mask for vbpermq, so we can permute bits into
>> + the first halfword. */
>
> I'm wondering (without having verified) if you can do something here akin to
> what's done in the "tail" sections below, using "vctzlsbb".
It does not help when the content spans more than 1 VR. I don't think
there is much to improve for a 64b mask reduction. Though, we can save
a couple cycles below using cnttzd (new in ISA 3.0).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-powerpc64le-add-optimized-strlen-for-P9.patch
Type: text/x-patch
Size: 10145 bytes
Desc: not available
URL: <https://sourceware.org/pipermail/libc-alpha/attachments/20200529/a55d3596/attachment.bin>
More information about the Libc-alpha
mailing list