[PATCH v3 5/5] AArch64: Improve A64FX memset
naohirot@fujitsu.com
naohirot@fujitsu.com
Thu Aug 26 01:44:02 GMT 2021
Hi Wilco,
> It's odd the behaviour with the same CPU isn't identical. If there is a way to make them
> behave more similarly, I would love to hear it! In any case it would be good to know
> how the blt workaround works on your system.
You can see the difference between FX1000 and FX700 (==Apollo 80) [1].
The number of cores and clock are different at least.
And the blt workaround worked for only FX700 but not for FX1000 as explained below.
[1] https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/
> > I'm not sure if I understood what the workaround code looks like, is it like this?
>
> It just injects a single blt at the top of the loop and changes the sub before the
> loop to subs, so you get something like this:
>
> subs count, count, tmp1
> .p2align 4
> 1: b.lt last
>
> I can propose a patch for this workaround if it isn't clear.
If you agree to the cmp and branch workaround (2 instructions at the beginning of the loop)
below, I'll submit a patch.
1) Result of the blt workaround (1 instruction at the beginning of the loop)
I tried two patterns,
subs count, count, tmp1
.p2align 4
1: b.lt L(last)
and
sub count, count, tmp1
.p2align 4
1: cbnz xzr, L(last)
Both patterns worked for only FX700, but not FX1000.
FX700 master vs v4fix 1 instruction [2]
FX700 v4 vs v4fix 1 instruction [3]
FX1000 master vs v4fix 1 instruction [4]
FX1000 v4 vs v4fix 1 instruction [5]
[2] https://drive.google.com/file/d/1IBsPYg2ia2t1YyMmaYVb7tFG89njO2aq/view?usp=sharing
[3] https://drive.google.com/file/d/1q44gqOWZvFhzKAe2di5y8EQrRwWkgxoU/view?usp=sharing
[4] https://drive.google.com/file/d/1P10oD0-WO8J5t7QiP7QwgqOqlAZ2I5hn/view?usp=sharing
[5] https://drive.google.com/file/d/1wKv-bPx20LgJyWl761gXiKLsPwhukEzx/view?usp=sharing
2) Result of the cmp and branch workaround (2 instructions at the beginning of the loop)
I tried two patterns,
sub count, count, tmp1
.p2align 4
1: subs tmp2, xzr, xzr
b.lt 1b
and
sub count, count, tmp1
.p2align 4
1: cmp xzr, xzr
b.ne 1b
Both patterns worked for FX700 and FX1000.
FX700 master vs v4fix 2 instructions [6]
FX700 v4 vs v4fix 2 instructions [7]
FX1000 master vs v4fix 2 instructions [8]
FX1000 v4 vs v4fix 2 instructions [9]
[6] https://drive.google.com/file/d/1B-CsRGT1rJFQCMHja78DEflQ-JHxSkGf/view?usp=sharing
[7] https://drive.google.com/file/d/1KCriikc1jIKEKLFoaTV0jYTqhtvmbblh/view?usp=sharing
[8] https://drive.google.com/file/d/1sunelmZ30jpd_aeWKXu65XNkS9X_akWb/view?usp=sharing
[9] https://drive.google.com/file/d/1JaJG0I79VMSTGy2PqaZf1SILujE69Gi2/view?usp=sharing
Thanks.
Naohiro
More information about the Libc-alpha
mailing list