This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: thunderx2 memmove performance improvements
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
- Date: Tue, 7 May 2019 15:07:37 +0000
- Subject: Re: [PATCH] aarch64: thunderx2 memmove performance improvements
- References: <AM6PR08MB5078AC4DC3EB3DDD09A5960583280@AM6PR08MB5078.eurprd08.prod.outlook.com> <20190415145311.GA14156@bell-sw.com> <AM6PR08MB5078F0BC2FA04013A2C5AE9283240@AM6PR08MB5078.eurprd08.prod.outlook.com> <eb76ffed-92db-59c0-015d-aea36530fc8c@bell-sw.com> <DB6PR0801MB21189BE4DA5751E84563DE09833B0@DB6PR0801MB2118.eurprd08.prod.outlook.com>,<5CCBFDA0.8020405@bell-sw.com>
Hi Anton,
>> The loop is used both for small copies and for the tail of
>> very large copies. The small copies might well be in the
>> cache, while the large copies are prefetched.
>
> Technically you are right. I was not clear enough, perhaps.
> What I meant was loads/stores have latency several times
> more than branches do. So, even if the data are in the cache
> the branching can be processed while loads/stores are still
> in the pipeline.
OK but if you're assuming the latency is high then what benefit
does unrolling give here? If the loop is 16-byte aligned fetch should
be optimal.
>> It's reasonable for now (and Szabolcs already approved your
>> latest version). But it is feasible to improve further given that the
>> memmove loop does 64 bytes per iteration, so if that is fast enough
>> then that may be a simpler way to handle this loop too.
> OK, I will see what I can do.
>
> BTW, the loop in question does 128 bytes per iteration, doesn't it?
The memmove loop does 64-bytes per iteration in all cases for the
overlap case:
+L(move_long):
...
+ .p2align 4
+1:
+ subs count, count, 64
+ stp A_q, B_q, [dstend, -32]
+ ldp A_q, B_q, [srcend, -32]
+ stp C_q, D_q, [dstend, -64]!
+ ldp C_q, D_q, [srcend, -64]!
+ b.hi 1b
Wilco