This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: thunderx2 memmove performance improvements


Hi Anton,

>> The loop is used both for small copies and for the tail of
>> very large copies.  The small copies might well be in the
>> cache, while the large copies are prefetched.
>
> Technically you are right. I was not clear enough, perhaps.
> What I meant was loads/stores have latency several times
> more than branches do. So, even if the data are in the cache
> the branching can be processed while loads/stores are still
> in the pipeline.

OK but if you're assuming the latency is high then what benefit
does unrolling give here? If the loop is 16-byte aligned fetch should
be optimal.

>> It's reasonable for now (and Szabolcs already approved your
>> latest version). But it is feasible to improve further given that the
>> memmove loop does 64 bytes per iteration, so if that is fast enough
>> then that may be a simpler way to handle this loop too.
> OK, I will see what I can do.
>
> BTW, the loop in question does 128 bytes per iteration, doesn't it?

The memmove loop does 64-bytes per iteration in all cases for the
overlap case:

+L(move_long):
...
+	.p2align 4
+1:
+	subs	count, count, 64
+	stp	A_q, B_q, [dstend, -32]
+	ldp	A_q, B_q, [srcend, -32]
+	stp	C_q, D_q, [dstend, -64]!
+	ldp	C_q, D_q, [srcend, -64]!
+	b.hi	1b

Wilco
    

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]