[PATCH v3 5/5] AArch64: Improve A64FX memset

Thu Aug 26 01:44:02 GMT 2021

Hi Wilco,

> It's odd the behaviour with the same CPU isn't identical. If there is a way to make them
> behave more similarly, I would love to hear it! In any case it would be good to know
> how the blt workaround works on your system.

You can see the difference between FX1000 and FX700 (==Apollo 80) [1].
The number of cores and clock are different at least.

And the blt workaround worked for only FX700 but not for FX1000 as explained below.

[1] https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/

> > I'm not sure if I understood what the workaround code looks like, is it like this?
> 
> It just injects a single blt at the top of the loop and changes the sub before the
> loop to subs, so you get something like this:
> 
>         subs    count, count, tmp1
>         .p2align 4
> 1:      b.lt    last
> 
> I can propose a patch for this workaround if it isn't clear.

If you agree to the cmp and branch workaround (2 instructions at the beginning of the loop)
below, I'll submit a patch.

1) Result of the blt workaround (1 instruction at the beginning of the loop)

I tried two patterns,

        subs    count, count, tmp1
        .p2align 4
1:      b.lt    L(last)

and

        sub     count, count, tmp1
        .p2align 4
1:      cbnz    xzr, L(last)

Both patterns worked for only FX700, but not FX1000.

FX700 master vs v4fix 1 instruction [2]
FX700 v4 vs v4fix 1 instruction [3]
FX1000 master vs v4fix 1 instruction [4]
FX1000 v4 vs v4fix 1 instruction [5]

[2] https://drive.google.com/file/d/1IBsPYg2ia2t1YyMmaYVb7tFG89njO2aq/view?usp=sharing
[3] https://drive.google.com/file/d/1q44gqOWZvFhzKAe2di5y8EQrRwWkgxoU/view?usp=sharing
[4] https://drive.google.com/file/d/1P10oD0-WO8J5t7QiP7QwgqOqlAZ2I5hn/view?usp=sharing
[5] https://drive.google.com/file/d/1wKv-bPx20LgJyWl761gXiKLsPwhukEzx/view?usp=sharing

2) Result of the cmp and branch workaround (2 instructions at the beginning of the loop)

I tried two patterns,

        sub     count, count, tmp1
        .p2align 4
1:      subs    tmp2, xzr, xzr
        b.lt    1b

and

        sub     count, count, tmp1
        .p2align 4
1:      cmp     xzr, xzr
        b.ne    1b

Both patterns worked for FX700 and FX1000.

FX700 master vs v4fix 2 instructions [6]
FX700 v4 vs v4fix 2 instructions [7]
FX1000 master vs v4fix 2 instructions [8]
FX1000 v4 vs v4fix 2 instructions [9]

[6] https://drive.google.com/file/d/1B-CsRGT1rJFQCMHja78DEflQ-JHxSkGf/view?usp=sharing
[7] https://drive.google.com/file/d/1KCriikc1jIKEKLFoaTV0jYTqhtvmbblh/view?usp=sharing
[8] https://drive.google.com/file/d/1sunelmZ30jpd_aeWKXu65XNkS9X_akWb/view?usp=sharing
[9] https://drive.google.com/file/d/1JaJG0I79VMSTGy2PqaZf1SILujE69Gi2/view?usp=sharing

Thanks.
Naohiro