[PATCH v3 5/5] AArch64: Improve A64FX memset
Wilco Dijkstra
Wilco.Dijkstra@arm.com
Mon Aug 9 14:52:37 GMT 2021
Hi Naohiro,
> Reverting unroll8 logic to V3 Part 4 fixed 16KB dip [4].
> See the comparison graphs between the master and V3 Part 5 fixed [4][5][6].
I don't see an improvement from the old unroll8 loop - there is about 2%
benefit on 16KB, but all other sizes become slower. At size 1K it is 50%
slower... I tried some other variations and moving the SUBS to the end
of the loop appears slightly better overall, so I've done that for the v4 patch.
Cheers,
Wilco
More information about the Libc-alpha
mailing list