This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] aarch64: thunderx2 memmove performance improvements
Thanks, that's much easier to follow. A few more comments:
@@ -103,188 +101,14 @@ ENTRY_ALIGN (MEMMOVE, 6)
+ add srcend, src, count
+ cmp count, 16
+ b.ls L(memcopy16)
sub tmp1, dstin, src
cmp count, 96
ccmp tmp1, count, 2, hi
That's 7 instructions, so the memcpy label ends up at an odd alignment.
+ str B_q, [dst, 16]
stp C_q, D_q, [dst, 32]
- str A_q, [dst, 64]
+ str G_q, [dst, 64]
+ str A_q, [dstin]
+ str E_q, [dstend, -16]
You can use STP rather than STR for B and G by changing the STP of C/D.
+ /* Write the last full set of 64 bytes. The remainder is at most 64
+ bytes, so it is safe to always copy 64 bytes from the start even if
+ there is just 1 byte left. */
+ ldr G_q, [src, 48]
+ str B_q, [dstend, -16]
+ ldr B_q, [src, 32]
+ str A_q, [dstend, -32]
+ ldr A_q, [src, 16]
+ str D_q, [dstend, -48]
+ ldr D_q, [src]
+ str C_q, [dstend, -64]
+ str G_q, [dstin, 48]
+ str B_q, [dstin, 32]
+ str A_q, [dstin, 16]
+ str D_q, [dstin]
3 LDP+ 3 STP opportunities here.
>> The idea of unrolling 2x is to only have a single loop branch. Using a single branch
>> makes it easier to remove all the writebacks too - you only need 2 rather than 8!
> The idea was to minimize the length of the loop tail. For branchless 128
> bytes per iteration loop the branchless tail needs to read 128 bytes.
> For the branched loop as the one above the tail processes only 64
> bytes. And I don't see how I can avoid writebacks of up to 127 bytes
> in the branchless tails for your version.
Well you can check whether you need to process more than 64 bytes using separate
code before the loop or in the tail. Jumping into the 2nd half of the loop may be possible
if you want to save code (but then again this memcpy is already quite large...).