[PATCH v3 5/5] AArch64: Improve A64FX memset

Wilco Dijkstra Wilco.Dijkstra@arm.com
Mon Aug 9 14:52:37 GMT 2021


Hi Naohiro,

> Reverting unroll8 logic to V3 Part 4 fixed 16KB dip [4].
> See the comparison graphs  between the master and V3 Part 5 fixed [4][5][6].

I don't see an improvement from the old unroll8 loop - there is about 2%
benefit on 16KB, but all other sizes become slower. At size 1K it is 50%
slower... I tried some other variations and moving the SUBS to the end
of the loop appears slightly better overall, so I've done that for the v4 patch.

Cheers,
Wilco


More information about the Libc-alpha mailing list