This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: thunderx2 memmove performance improvements
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
- Date: Fri, 12 Apr 2019 18:03:09 +0000
- Subject: Re: [PATCH] aarch64: thunderx2 memmove performance improvements
Hi Anton,
This looks like a good cleanup! A few comments and suggestions for
improvements:
0. The diff is quite large due to tab/space changes. Would it be possible to split
this into a separate patch?
1. There are a few cases where ldp or stp could be used, but isn't, eg:
+ str B_q, [dst], #16
+ ldp H_q, I_q, [src], #32
+ str C_q, [dst], #16
Why not do stp B_q, C_q, [dst], 32?
2. There are a lot of writeback instructions used in cases where this isn't
strictly required. Have you noticed this actually improves performance? Even
if required for the main loop, it is best to reduce them where possible:
+L(loop128_exit0):
+ ldp F_q, G_q, [srcend, -64]
+ ldp H_q, I_q, [srcend, -32]
+ stp B_q, C_q, [dst], #32
+ stp D_q, E_q, [dst], #32
+ stp F_q, G_q, [dstend, -64]
+ stp H_q, I_q, [dstend, -32]
+ ret
L(loop128_exit1):
+ ldp B_q, C_q, [srcend, -64]
+ ldp D_q, E_q, [srcend, -32]
+ stp F_q, G_q, [dst], #32
+ stp H_q, I_q, [dst], #32
+ stp B_q, C_q, [dstend, -64]
+ stp D_q, E_q, [dstend, -32]
+ ret
Here dst is not used but incremented twice.
3. Missed optimization:
+L(dst_unaligned_tail):
+ ldp C_q, D_q, [srcend, -64]
+ ldp E_q, F_q, [srcend, -32]
+ stp A_q, B_q, [dst], #32
+ stp H_q, I_q, [dst], #32
+ add dst, dst, tmp1
+ str G_q, [dst, -16]
+ stp C_q, D_q, [dstend, -64]
+ stp E_q, F_q, [dstend, -32]
ret
Surely this could use str G_q, [dst, tmp1] if we change the writeback on the stp?
4. Unrolling can be more efficient:
L(loop128):
+ ldp F_q, G_q, [src], #32
+ ldp H_q, I_q, [src], #32
+ stp B_q, C_q, [dst], #32
+ stp D_q, E_q, [dst], #32
+ subs count, count, 64
+ b.lt L(loop128_exit1)
+ ldp B_q, C_q, [src], #32
+ ldp D_q, E_q, [src], #32
+ stp F_q, G_q, [dst], #32
+ stp H_q, I_q, [dst], #32
+ subs count, count, 64
+ b.ge L(loop128)
+L(loop128_exit0):
The idea of unrolling 2x is to only have a single loop branch. Using a single branch
makes it easier to remove all the writebacks too - you only need 2 rather than 8!
Wilco