[PATCH v3 5/5] AArch64: Improve A64FX memset

Tue Aug 24 07:56:15 GMT 2021

Hi Wilco,

> > In my environment, I don't have any performance degradation by reverting unroll8,
> > but 16KB performance improvement as shown in the graphs.
> 
> I still see a major regression at 1KB in the graph (it is larger relatively than the gain at 16KB),
> plus many smaller regressions between 2KB-8KB.

Are you talking about the regression between V4 and V4 fixed?
If so, that is also observed in my environment as shown in the graph [2].
But V4 fixed is not degraded than the master as shown in the graph [1].

I think we are getting almost same result each other, but not exactly same, right?

> > The first graph [1] shows comparison the master with V4 fixed.
> > The second graph [2] shows comparison V4 with V4 fixed.
> > 
> > [1] https://drive.google.com/file/d/19og4ZhU9itzFAVXX8TIzlpgiiukiXQbp/view?usp=sharing
> > [2] https://drive.google.com/file/d/1wQgPU6GyRQ_Z8ibsGja-NfdKhN5bz7I9/view?usp=sharing 

> > In your environment, do you have any performance degradation by reverting unroll8?
> > If there is no disadvantage by reverting unroll8, why don't we revert it?
> 
> For me bench-memset shows a 50% regression with the unroll8 loop reverted plus
> many smaller regressions. So I don't think reverting is a good idea.

If the 50% regression in your environment is at 1KB, the regression at 1KB happens
in my environment too as shown in the graph [4], but the rate seems less than 50%.

Both your result and my result are true and real.
I don't think it's rational to make decision by looking at only one environment result.

As I explained at the bottom of this mail, V4 code is tuned to Applo 80 and FX700.
So we need to take FX1000 into account too.

> I tried "perf stat" and oddly enough this loop causes a lot of branch mispredictions.
> However if you add a branch at the top of the loop that is never taken (eg. blt and
> ensuring the sub above it sets the flags), it becomes faster than the best results so far.
> If you can reproduce that, it is probably the best workaround.

Does "it becomes faster than the best results so far" mean faster than the master?
I think we should put the baseline or bottom line to the master performance.
If the workaround is not faster than or equal to the master at 16KB which has the peak
performance, reverting unroll8 is preferable. 

I'm not sure if I understood what the workaround code looks like, is it like this?

L(unroll8):
        sub     count, count, tmp1
        .p2align 4
1:      subs    tmp2, xzr, xzr
        b.lt    1b
        st1b_unroll 0, 7
        add     dst, dst, tmp1
        subs    count, count, tmp1
        b.hi    1b
        add     count, count, tmp1

> > Is it HPE Apollo 80 System?
> > Or does ARM Company have an account to Fujitsu FX1000 or FX700?
> 
> It has 48 cores, that's all I know...

I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
OS sees FX1000 as if it has 8 NUMA nodes.

Thanks.
Naohiro