This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v2] aarch64: thunderx2 memcpy branches reordering
- From: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Fri, 22 Mar 2019 22:38:14 +0300
- Subject: Re: [PATCH v2] aarch64: thunderx2 memcpy branches reordering
- References: <AM6PR08MB50786EC639E01EBAFEB5CB3483430@AM6PR08MB5078.eurprd08.prod.outlook.com>
Wilco,
Thanks a lot for your comments.
On 3/22/2019 19:18, Wilco Dijkstra wrote:
Hi Anton,
Rewrote the branches in load and merge chunk
so that the order is more in line with the
most probable case.
Why not make the code more efficient as well? The loop has 2 redundant
moves and the code for each alignment is twice as large as it could be.
It's not a large amount of effort to do better, see below.
@@ -557,17 +558,9 @@ L(ext_size_ ## shft):;\
ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\
ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\
These instructions are already part of the main loop, so why not directly
branch into it rather than repeat them?
Mostly for clarity - branching inside a loop is not a very
obvious thing.
subs count, count, 32;\
- b.ge 2f;\
+ b.lt 2f;\
This branch is completely redundant since count is always larger than 32.
Yep, you are right. Thanks.
1:;\
stp A_q, B_q, [dst], #32;\
- ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
- ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\
- stp H_q, I_q, [dst], #16;\
- add dst, dst, tmp1;\
- str G_q, [dst], #16;\
- b L(copy_long_check32);\
-2:;\
- stp A_q, B_q, [dst], #32;\
prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\
ldp D_q, J_q, [src], #32;\
ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
@@ -579,8 +572,15 @@ L(ext_size_ ## shft):;\
ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\
mov E_v.16b, J_v.16b;\
Redundant move in loop (2x). You might as well execute the next ext!
Yes, I might if I wasn't in the loop. As I constrained by the finite
number of register names I need a value in a particular
register. The fact that it is already in some other register does
not help me here. We have the window spanning 5 registers
and we load only 4 registers each iteration. So, how do I update
the fifth one?
I agree that 2 moves are redundant in this case, of course.
subs count, count, 64;\
- b.ge 2b;\
- b 1b;\
+ b.ge 1b;\
+2:;\
+ stp A_q, B_q, [dst], #32;\
+ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
+ ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\
These instructions appear in the loop already, why not break out of
the loop after executing them?
Again, this is only for clarity reasons.
+ stp H_q, I_q, [dst], #16;\
+ add dst, dst, tmp1;\
+ str G_q, [dst], #16;\
+ b L(copy_long_check32);\
These instructions are invariant with respect to alignment, so why repeat
these many times?
You are right, invariants should be hoisted. And they will.