This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: thunderx2 memmove performance improvements


Hi Anton,

Thanks, that's much easier to follow. A few more comments:

@@ -103,188 +101,14 @@ ENTRY_ALIGN (MEMMOVE, 6)
 	DELOUSE (1)
 	DELOUSE (2)
 
+	add	srcend, src, count
+	cmp	count, 16
+	b.ls	L(memcopy16)
 	sub	tmp1, dstin, src
 	cmp	count, 96
 	ccmp	tmp1, count, 2, hi
 	b.lo	L(move_long)

That's 7 instructions, so the memcpy label ends up at an odd alignment.

+	str	B_q, [dst, 16]
 	stp	C_q, D_q, [dst, 32]
-	str     A_q, [dst, 64]
+	str	G_q, [dst, 64]
+	str	A_q, [dstin]
+	str	E_q, [dstend, -16]

You can use STP rather than STR for B and G by changing the STP of C/D.

+	/* Write the last full set of 64 bytes.  The remainder is at most 64
+	   bytes, so it is safe to always copy 64 bytes from the start even if
+	   there is just 1 byte left.  */
+2:
+	ldr	G_q, [src, 48]
+	str	B_q, [dstend, -16]
+	ldr	B_q, [src, 32]
+	str	A_q, [dstend, -32]
+	ldr	A_q, [src, 16]
+	str	D_q, [dstend, -48]
+	ldr	D_q, [src]
+	str	C_q, [dstend, -64]
+	str	G_q, [dstin, 48]
+	str	B_q, [dstin, 32]
+	str	A_q, [dstin, 16]
+	str	D_q, [dstin]
+3:	ret

3 LDP+ 3 STP opportunities here.

>> The idea of unrolling 2x is to only have a single loop branch. Using a single branch
>> makes it easier to remove all the writebacks too - you only need 2 rather than 8!
> The idea was to minimize the length of the loop tail. For branchless 128
> bytes per iteration loop the branchless tail needs to read 128 bytes.
> For the branched loop as the one above the tail processes only 64
> bytes. And I don't see how I can avoid writebacks of up to 127 bytes
> in the branchless tails for your version.

Well you can check whether you need to process more than 64 bytes using separate
code before the loop or in the tail. Jumping into the 2nd half of the loop may be possible
if you want to save code (but then again this memcpy is already quite large...).

Wilco

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]