This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: thunderx2 memmove performance improvements


Wilco,

Thanks a lot for you comments and suggestions.

On 18.04.2019 18:22, Wilco Dijkstra wrote:
Hi Anton,

Thanks, that's much easier to follow. A few more comments:

@@ -103,188 +101,14 @@ ENTRY_ALIGN (MEMMOVE, 6)
  	DELOUSE (1)
  	DELOUSE (2)
+ add srcend, src, count
+	cmp	count, 16
+	b.ls	L(memcopy16)
  	sub	tmp1, dstin, src
  	cmp	count, 96
  	ccmp	tmp1, count, 2, hi
  	b.lo	L(move_long)

That's 7 instructions, so the memcpy label ends up at an odd alignment.
I added .p2align before memcpy entry to make it 16 byes
aligned.

+	str	B_q, [dst, 16]
  	stp	C_q, D_q, [dst, 32]
-	str     A_q, [dst, 64]
+	str	G_q, [dst, 64]
+	str	A_q, [dstin]
+	str	E_q, [dstend, -16]

You can use STP rather than STR for B and G by changing the STP of C/D.
Done.

+	/* Write the last full set of 64 bytes.  The remainder is at most 64
+	   bytes, so it is safe to always copy 64 bytes from the start even if
+	   there is just 1 byte left.  */
+2:
+	ldr	G_q, [src, 48]
+	str	B_q, [dstend, -16]
+	ldr	B_q, [src, 32]
+	str	A_q, [dstend, -32]
+	ldr	A_q, [src, 16]
+	str	D_q, [dstend, -48]
+	ldr	D_q, [src]
+	str	C_q, [dstend, -64]
+	str	G_q, [dstin, 48]
+	str	B_q, [dstin, 32]
+	str	A_q, [dstin, 16]
+	str	D_q, [dstin]
+3:	ret

3 LDP+ 3 STP opportunities here.
Done.


The idea of unrolling 2x is to only have a single loop branch. Using a single branch
makes it easier to remove all the writebacks too - you only need 2 rather than 8!
The idea was to minimize the length of the loop tail. For branchless 128
bytes per iteration loop the branchless tail needs to read 128 bytes.
For the branched loop as the one above the tail processes only 64
bytes. And I don't see how I can avoid writebacks of up to 127 bytes
in the branchless tails for your version.

Well you can check whether you need to process more than 64 bytes using separate
code before the loop or in the tail. Jumping into the 2nd half of the loop may be possible
if you want to save code (but then again this memcpy is already quite large...).
The check inside the loop is free as it
is done while the data are being brought from the memory.
I can move the check to prologue/epilogue and while this
can still be free the codepath will look less clear as I
need to handle 2 "asymmetric" cases in the epilogue (if
the excessive writebacks are to be avoided): one for <64
bytes and the other >=64 bytes. This also makes the tails
longer.
I can probably merge the tails making the longer case
falls through to the shorter but this makes the things even
less clear.
As there is no real performance nor clarity benefit of using
single branch loop I am inclined to leave the 128 bytes loop
as it is now. Do you think this is reasonable or I'm missing
something?

--
  Thanks,
  Anton


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]