This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] aarch64: thunderx2 memmove performance improvements
Thanks a lot for you comments and suggestions.
On 18.04.2019 18:22, Wilco Dijkstra wrote:
Thanks, that's much easier to follow. A few more comments:
@@ -103,188 +101,14 @@ ENTRY_ALIGN (MEMMOVE, 6)
+ add srcend, src, count
+ cmp count, 16
+ b.ls L(memcopy16)
sub tmp1, dstin, src
cmp count, 96
ccmp tmp1, count, 2, hi
That's 7 instructions, so the memcpy label ends up at an odd alignment.
I added .p2align before memcpy entry to make it 16 byes
+ str B_q, [dst, 16]
stp C_q, D_q, [dst, 32]
- str A_q, [dst, 64]
+ str G_q, [dst, 64]
+ str A_q, [dstin]
+ str E_q, [dstend, -16]
You can use STP rather than STR for B and G by changing the STP of C/D.
+ /* Write the last full set of 64 bytes. The remainder is at most 64
+ bytes, so it is safe to always copy 64 bytes from the start even if
+ there is just 1 byte left. */
+ ldr G_q, [src, 48]
+ str B_q, [dstend, -16]
+ ldr B_q, [src, 32]
+ str A_q, [dstend, -32]
+ ldr A_q, [src, 16]
+ str D_q, [dstend, -48]
+ ldr D_q, [src]
+ str C_q, [dstend, -64]
+ str G_q, [dstin, 48]
+ str B_q, [dstin, 32]
+ str A_q, [dstin, 16]
+ str D_q, [dstin]
3 LDP+ 3 STP opportunities here.
The idea of unrolling 2x is to only have a single loop branch. Using a single branch
makes it easier to remove all the writebacks too - you only need 2 rather than 8!
The idea was to minimize the length of the loop tail. For branchless 128
bytes per iteration loop the branchless tail needs to read 128 bytes.
For the branched loop as the one above the tail processes only 64
bytes. And I don't see how I can avoid writebacks of up to 127 bytes
in the branchless tails for your version.
Well you can check whether you need to process more than 64 bytes using separate
code before the loop or in the tail. Jumping into the 2nd half of the loop may be possible
if you want to save code (but then again this memcpy is already quite large...).
The check inside the loop is free as it
is done while the data are being brought from the memory.
I can move the check to prologue/epilogue and while this
can still be free the codepath will look less clear as I
need to handle 2 "asymmetric" cases in the epilogue (if
the excessive writebacks are to be avoided): one for <64
bytes and the other >=64 bytes. This also makes the tails
I can probably merge the tails making the longer case
falls through to the shorter but this makes the things even
As there is no real performance nor clarity benefit of using
single branch loop I am inclined to leave the 128 bytes loop
as it is now. Do you think this is reasonable or I'm missing