This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] aarch64: thunderx2 memmove performance improvements
- From: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- Cc: "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
- Date: Fri, 3 May 2019 11:36:48 +0300
- Subject: Re: [PATCH] aarch64: thunderx2 memmove performance improvements
- References: <AM6PR08MB5078AC4DC3EB3DDD09A5960583280@AM6PR08MB5078.eurprd08.prod.outlook.com> <20190415145311.GA14156@bell-sw.com> <AM6PR08MB5078F0BC2FA04013A2C5AE9283240@AM6PR08MB5078.eurprd08.prod.outlook.com> <eb76ffed-92db-59c0-015d-aea36530fc8c@bell-sw.com> <DB6PR0801MB21189BE4DA5751E84563DE09833B0@DB6PR0801MB2118.eurprd08.prod.outlook.com>
Wilco,
On 5/1/2019 17:59, Wilco Dijkstra wrote:
Hi Anton,
The check inside the loop is free as it
is done while the data are being brought from the memory.
The loop is used both for small copies and for the tail of
very large copies. The small copies might well be in the
cache, while the large copies are prefetched.
Technically you are right. I was not clear enough, perhaps.
What I meant was loads/stores have latency several times
more than branches do. So, even if the data are in the cache
the branching can be processed while loads/stores are still
in the pipeline.
I can move the check to prologue/epilogue and while this
can still be free the codepath will look less clear as I
need to handle 2 "asymmetric" cases in the epilogue (if
the excessive writebacks are to be avoided): one for <64
bytes and the other >=64 bytes. This also makes the tails
longer.
I can probably merge the tails making the longer case
falls through to the shorter but this makes the things even
less clear.
As there is no real performance nor clarity benefit of using
single branch loop I am inclined to leave the 128 bytes loop
as it is now. Do you think this is reasonable or I'm missing
something?
It's reasonable for now (and Szabolcs already approved your
latest version). But it is feasible to improve further given that the
memmove loop does 64 bytes per iteration, so if that is fast enough
then that may be a simpler way to handle this loop too.
OK, I will see what I can do.
BTW, the loop in question does 128 bytes per iteration, doesn't it?
--
Thanks,
Anton