This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] aarch64: thunderx2 memmove performance improvements

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
Date: Fri, 12 Apr 2019 18:03:09 +0000
Subject: Re: [PATCH] aarch64: thunderx2 memmove performance improvements

Hi Anton,

This looks like a good cleanup! A few comments and suggestions for
improvements:

0. The diff is quite large due to tab/space changes. Would it be possible to split
this into a separate patch?

1. There are a few cases where ldp or stp could be used, but isn't, eg:

+	str	B_q, [dst], #16
+	ldp	H_q, I_q, [src], #32
+	str	C_q, [dst], #16

Why not do stp B_q, C_q, [dst], 32?

2. There are a lot of writeback instructions used in cases where this isn't
strictly required. Have you noticed this actually improves performance? Even
if required for the main loop, it is best to reduce them where possible:

+L(loop128_exit0):
+	ldp	F_q, G_q, [srcend, -64]
+	ldp	H_q, I_q, [srcend, -32]
+	stp	B_q, C_q, [dst], #32
+	stp	D_q, E_q, [dst], #32
+	stp	F_q, G_q, [dstend, -64]
+	stp	H_q, I_q, [dstend, -32]
+	ret

 L(loop128_exit1):
+	ldp	B_q, C_q, [srcend, -64]
+	ldp	D_q, E_q, [srcend, -32]
+	stp	F_q, G_q, [dst], #32
+	stp	H_q, I_q, [dst], #32
+	stp	B_q, C_q, [dstend, -64]
+	stp	D_q, E_q, [dstend, -32]
+	ret

Here dst is not used but incremented twice.

3. Missed optimization:

+L(dst_unaligned_tail):
+	ldp	C_q, D_q, [srcend, -64]
+	ldp	E_q, F_q, [srcend, -32]
+	stp	A_q, B_q, [dst], #32
+	stp	H_q, I_q, [dst], #32
+	add	dst, dst, tmp1
+	str	G_q, [dst, -16]
+	stp	C_q, D_q, [dstend, -64]
+	stp	E_q, F_q, [dstend, -32]
 	ret

Surely this could use str	G_q, [dst, tmp1] if we change the writeback on the stp?

4. Unrolling can be more efficient:

 L(loop128):
+	ldp	F_q, G_q, [src], #32
+	ldp	H_q, I_q, [src], #32
+	stp	B_q, C_q, [dst], #32
+	stp	D_q, E_q, [dst], #32
+	subs	count, count, 64
+	b.lt	L(loop128_exit1)
+	ldp	B_q, C_q, [src], #32
+	ldp	D_q, E_q, [src], #32
+	stp	F_q, G_q, [dst], #32
+	stp	H_q, I_q, [dst], #32
+	subs	count, count, 64
+	b.ge	L(loop128)
+L(loop128_exit0):

The idea of unrolling 2x is to only have a single loop branch. Using a single branch 
makes it easier to remove all the writebacks too - you only need 2 rather than 8!

Wilco

Follow-Ups:
- Re: [PATCH] aarch64: thunderx2 memmove performance improvements
  - From: Anton Youdkevitch

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]