[PATCH 3/3] aarch64: Remove non-temporal load/stores from oryon-1's memset
Adhemerval Zanella
adhemerval.zanella@linaro.org
Mon Nov 18 20:45:40 GMT 2024
On Fri, Nov 15, 2024 at 12:05 AM Andrew Pinski <quic_apinski@quicinc.com> wrote:
>
> The hardware architects have a new recommendation not to use
> non-temporal load/stores for memset. This patch removes this path.
> I found there was no difference in the memset speed with/without
> non-temporal load/stores either.
>
> Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
LGTM, thanks.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
> ---
> sysdeps/aarch64/multiarch/memset_oryon1.S | 26 -----------------------
> 1 file changed, 26 deletions(-)
>
> diff --git a/sysdeps/aarch64/multiarch/memset_oryon1.S b/sysdeps/aarch64/multiarch/memset_oryon1.S
> index 6fa28a9bd0..b63c16ec51 100644
> --- a/sysdeps/aarch64/multiarch/memset_oryon1.S
> +++ b/sysdeps/aarch64/multiarch/memset_oryon1.S
> @@ -93,8 +93,6 @@ L(set_long):
> cmp count, 256
> ccmp valw, 0, 0, cs
> b.eq L(try_zva)
> - cmp count, #32768
> - b.hi L(set_long_with_nontemp)
> /* Small-size or non-zero memset does not use DC ZVA. */
> sub count, dstend, dst
>
> @@ -117,30 +115,6 @@ L(set_long):
> stp val, val, [dstend, -16]
> ret
>
> -L(set_long_with_nontemp):
> - /* Small-size or non-zero memset does not use DC ZVA. */
> - sub count, dstend, dst
> -
> - /* Adjust count and bias for loop. By subtracting extra 1 from count,
> - it is easy to use tbz instruction to check whether loop tailing
> - count is less than 33 bytes, so as to bypass 2 unnecessary stps. */
> - sub count, count, 64+16+1
> -
> -1: stnp val, val, [dst, 16]
> - stnp val, val, [dst, 32]
> - stnp val, val, [dst, 48]
> - stnp val, val, [dst, 64]
> - add dst, dst, #64
> - subs count, count, 64
> - b.hs 1b
> -
> - tbz count, 5, 1f /* Remaining count is less than 33 bytes? */
> - stnp val, val, [dst, 16]
> - stnp val, val, [dst, 32]
> -1: stnp val, val, [dstend, -32]
> - stnp val, val, [dstend, -16]
> - ret
> -
> L(try_zva):
> /* Write the first and last 64 byte aligned block using stp rather
> than using DC ZVA as it is faster. */
> --
> 2.43.0
>
More information about the Libc-alpha
mailing list