This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] aarch64: thunderx2 memcpy branches reordering


Wilco,

Thanks a lot for your comments.

On 3/22/2019 19:18, Wilco Dijkstra wrote:
Hi Anton,

Rewrote the branches in load and merge chunk
so that the order is more in line with the
most probable case.

Why not make the code more efficient as well? The loop has 2 redundant
moves and the code for each alignment is twice as large as it could be.
It's not a large amount of effort to do better, see below.

@@ -557,17 +558,9 @@ L(ext_size_ ## shft):;\
  	ext     A_v.16b, C_v.16b, D_v.16b, 16-shft;\
  	ext     B_v.16b, D_v.16b, E_v.16b, 16-shft;\

These instructions are already part of the main loop, so why not directly
branch into it rather than repeat them?
Mostly for clarity - branching inside a loop is not a very
obvious thing.

  	subs    count, count, 32;\
-	b.ge    2f;\
+	b.lt    2f;\

This branch is completely redundant since count is always larger than 32.
Yep, you are right. Thanks.

  1:;\
  	stp     A_q, B_q, [dst], #32;\
-	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
-	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
-	stp     H_q, I_q, [dst], #16;\
-	add     dst, dst, tmp1;\
-	str     G_q, [dst], #16;\
-	b       L(copy_long_check32);\
-2:;\
-	stp     A_q, B_q, [dst], #32;\
  	prfm    pldl1strm, [src, MEMCPY_PREFETCH_LDR];\
  	ldp     D_q, J_q, [src], #32;\
  	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
@@ -579,8 +572,15 @@ L(ext_size_ ## shft):;\
  	ext     B_v.16b, D_v.16b, J_v.16b, 16-shft;\
  	mov     E_v.16b, J_v.16b;\

Redundant move in loop (2x). You might as well execute the next ext!
Yes, I might if I wasn't in the loop. As I constrained by the finite
number of register names I need a value in a particular
register. The fact that it is already in some other register does
not help me here. We have the window spanning 5 registers
and we load only 4 registers each iteration. So, how do I update
the fifth one?

I agree that 2 moves are redundant in this case, of course.

  	subs    count, count, 64;\
-	b.ge    2b;\
-	b	1b;\
+	b.ge    1b;\
+2:;\
+	stp     A_q, B_q, [dst], #32;\
+	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
+	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\

These instructions appear in the loop already, why not break out of
the loop after executing them?
Again, this is only for clarity reasons.

+	stp     H_q, I_q, [dst], #16;\
+	add     dst, dst, tmp1;\
+	str     G_q, [dst], #16;\
+	b       L(copy_long_check32);\

These instructions are invariant with respect to alignment, so why repeat
these many times?
You are right, invariants should be hoisted. And they will.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]