This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Add ifunc memcpy and memmove for aarch64

From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
To: Siddhesh Poyarekar <siddhesh at gotplt dot org>, Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "sellcey at caviumnetworks dot com" <sellcey at caviumnetworks dot com>
Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
Date: Tue, 7 Feb 2017 11:22:06 -0200
Subject: Re: [PATCH] Add ifunc memcpy and memmove for aarch64
Authentication-results: sourceware.org; auth=none
References: <AM5PR0802MB2610A11854986B9A53EBC68783430@AM5PR0802MB2610.eurprd08.prod.outlook.com> <c69e79db-4c42-0a14-dd11-d4825612b351@gotplt.org>


On 07/02/2017 11:01, Siddhesh Poyarekar wrote:
> On Tuesday 07 February 2017 06:12 PM, Wilco Dijkstra wrote:
>> I agree we want to avoid using conditional compilation as much as possible.
>> On the other hand duplication is a bad idea too, I've seen too many cases where
>> bugs were only fixed in one of the N duplicates.
> 
> Sure, but then in that case the de-duplication must be done by
> identifying a logical code block and make that into a macro to override
> and not just arbitrarily inject hunks of code.  So in this case it could
> be alternate implementations of copy_long that is sufficient so #define
> COPY_LONG in both memcpy_generic and memcpy_thunderx and have the parent
> (memcpy.S) use that macro.  In fact, that might even end up making the
> code a bit nicer to read.
> 
>> However I'm actually wondering whether we need an ifunc for this case.
>> For large copies from L2 I think adding a prefetch should be benign even on 
>> cores that don't need it, so if the benchmarks confirm this we should consider
>> updating the generic memcpy.
> 
> That is a call that ARM maintainers can take and is also another reason
> to separate the IFUNC infrastructure code from the thunderx change.

I checked only the memcpy change on a APM X-Gene 1 and results seems to show
improvements on aligned input, at least for sizes shorter thatn 4MB.  I would
like to check on more armv8 chips, but it does seems a nice improvement
over generic implementation.

Attachment: bench-memcpy-large.out
Description: Text document

Attachment: bench-memcpy-large.patched
Description: Text document

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index 29af8b1..4742a01 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -158,10 +158,13 @@ L(copy96):
 
 	.p2align 4
 L(copy_long):
+	cmp	count, #32768
+	b.lo	L(copy_long_without_prefetch)
 	and	tmp1, dstin, 15
 	bic	dst, dstin, 15
 	ldp	D_l, D_h, [src]
 	sub	src, src, tmp1
+	prfm	pldl1strm, [src, 384]
 	add	count, count, tmp1	/* Count is now 16 too large.  */
 	ldp	A_l, A_h, [src, 16]
 	stp	D_l, D_h, [dstin]
@@ -169,7 +172,10 @@ L(copy_long):
 	ldp	C_l, C_h, [src, 48]
 	ldp	D_l, D_h, [src, 64]!
 	subs	count, count, 128 + 16	/* Test and readjust count.  */
-	b.ls	2f
+
+L(prefetch_loop64):
+	tbz	src, #6, 1f
+	prfm	pldl1strm, [src, 512]
 1:
 	stp	A_l, A_h, [dst, 16]
 	ldp	A_l, A_h, [src, 16]
@@ -180,12 +186,39 @@ L(copy_long):
 	stp	D_l, D_h, [dst, 64]!
 	ldp	D_l, D_h, [src, 64]!
 	subs	count, count, 64
-	b.hi	1b
+	b.hi	L(prefetch_loop64)
+	b	L(last64)
+
+L(copy_long_without_prefetch):
+
+	and	tmp1, dstin, 15
+	bic	dst, dstin, 15
+	ldp	D_l, D_h, [src]
+	sub	src, src, tmp1
+	add	count, count, tmp1	/* Count is now 16 too large.  */
+	ldp	A_l, A_h, [src, 16]
+	stp	D_l, D_h, [dstin]
+	ldp	B_l, B_h, [src, 32]
+	ldp	C_l, C_h, [src, 48]
+	ldp	D_l, D_h, [src, 64]!
+	subs	count, count, 128 + 16	/* Test and readjust count.  */
+	b.ls	L(last64)
+L(loop64):
+	stp	A_l, A_h, [dst, 16]
+	ldp	A_l, A_h, [src, 16]
+	stp	B_l, B_h, [dst, 32]
+	ldp	B_l, B_h, [src, 32]
+	stp	C_l, C_h, [dst, 48]
+	ldp	C_l, C_h, [src, 48]
+	stp	D_l, D_h, [dst, 64]!
+	ldp	D_l, D_h, [src, 64]!
+	subs	count, count, 64
+	b.hi	L(loop64)
 
 	/* Write the last full set of 64 bytes.  The remainder is at most 64
 	   bytes, so it is safe to always copy 64 bytes from the end even if
 	   there is just 1 byte left.  */
-2:
+L(last64):
 	ldp	E_l, E_h, [srcend, -64]
 	stp	A_l, A_h, [dst, 16]
 	ldp	A_l, A_h, [srcend, -48]

Follow-Ups:
- Re: [PATCH] Add ifunc memcpy and memmove for aarch64
  - From: Steve Ellcey

References:
- Re: [PATCH] Add ifunc memcpy and memmove for aarch64
  - From: Wilco Dijkstra
- Re: [PATCH] Add ifunc memcpy and memmove for aarch64
  - From: Siddhesh Poyarekar

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]