[PATCH v3 5/5] AArch64: Improve A64FX memset

Tue Aug 24 08:07:14 GMT 2021

Fixed a typo inline

> -----Original Message-----
> From: Libc-alpha <libc-alpha-bounces+naohirot=fujitsu.com@sourceware.org> On Behalf Of naohirot--- via Libc-alpha
> Sent: Tuesday, August 24, 2021 4:56 PM
> To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Cc: 'GNU C Library' <libc-alpha@sourceware.org>
> Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset
> 
> Hi Wilco,
> 
> > > In my environment, I don't have any performance degradation by reverting unroll8,
> > > but 16KB performance improvement as shown in the graphs.
> >
> > I still see a major regression at 1KB in the graph (it is larger relatively than the gain at 16KB),
> > plus many smaller regressions between 2KB-8KB.
> 
> Are you talking about the regression between V4 and V4 fixed?
> If so, that is also observed in my environment as shown in the graph [2].
> But V4 fixed is not degraded than the master as shown in the graph [1].
> 
> I think we are getting almost same result each other, but not exactly same, right?
> 
> > > The first graph [1] shows comparison the master with V4 fixed.
> > > The second graph [2] shows comparison V4 with V4 fixed.
> > >
> > > [1] https://drive.google.com/file/d/19og4ZhU9itzFAVXX8TIzlpgiiukiXQbp/view?usp=sharing
> > > [2] https://drive.google.com/file/d/1wQgPU6GyRQ_Z8ibsGja-NfdKhN5bz7I9/view?usp=sharing
> 
> > > In your environment, do you have any performance degradation by reverting unroll8?
> > > If there is no disadvantage by reverting unroll8, why don't we revert it?
> >
> > For me bench-memset shows a 50% regression with the unroll8 loop reverted plus
> > many smaller regressions. So I don't think reverting is a good idea.
> 
> If the 50% regression in your environment is at 1KB, the regression at 1KB happens
> in my environment too as shown in the graph [4], but the rate seems less than 50%.
> 
"the graph [4]" should be "the graph [2]".

Thanks.
Naohiro

> Both your result and my result are true and real.
> I don't think it's rational to make decision by looking at only one environment result.
> 
> As I explained at the bottom of this mail, V4 code is tuned to Applo 80 and FX700.
> So we need to take FX1000 into account too.
> 
> > I tried "perf stat" and oddly enough this loop causes a lot of branch mispredictions.
> > However if you add a branch at the top of the loop that is never taken (eg. blt and
> > ensuring the sub above it sets the flags), it becomes faster than the best results so far.
> > If you can reproduce that, it is probably the best workaround.
> 
> Does "it becomes faster than the best results so far" mean faster than the master?
> I think we should put the baseline or bottom line to the master performance.
> If the workaround is not faster than or equal to the master at 16KB which has the peak
> performance, reverting unroll8 is preferable.
> 
> I'm not sure if I understood what the workaround code looks like, is it like this?
> 
> L(unroll8):
>         sub     count, count, tmp1
>         .p2align 4
> 1:      subs    tmp2, xzr, xzr
>         b.lt    1b
>         st1b_unroll 0, 7
>         add     dst, dst, tmp1
>         subs    count, count, tmp1
>         b.hi    1b
>         add     count, count, tmp1
> 
> > > Is it HPE Apollo 80 System?
> > > Or does ARM Company have an account to Fujitsu FX1000 or FX700?
> >
> > It has 48 cores, that's all I know...
> 
> I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
> FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
> OS sees FX1000 as if it has 8 NUMA nodes.
> 
> Thanks.
> Naohiro