This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.
- From: "Zhangxuelei (Derek)" <zhangxuelei4 at huawei dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "siddhesh at gotplt dot org" <siddhesh at gotplt dot org>, Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, jiangyikun <jiangyikun at huawei dot com>, "yikunkero at gmail dot com" <yikunkero at gmail dot com>
- Cc: nd <nd at arm dot com>
- Date: Thu, 31 Oct 2019 15:55:13 +0000
- Subject: Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.
> Would it not make more sense to use traditional unrolling? Eg. process
> 128 or 256 bytes per iteration instead of 3x 64?
Well, I think it's not the unrolling in real sense, because there is a judgment whether reaches the tail after each 64 bytes setting. So there is no difference between 3x and 4x, and it's OK unrolling to 4x.
> You mean DC_ZVA does not work (ie. disabled in OS) or it doesn't give a speedup?
> That sounds odd...
Well, I mean it did not give an so obviously speedup in generic that using set_long with stp to set zero is better in Kunpeng processor. And it seems the DC_ZVA in latest version- memset_base64 has similar performance with our version.
> There is little point in branching over just 1 instruction - it's
> cheaper just to execute it than risk the misprediction (it would
> need to use dstend rather than dstin).
Here the setting inerval if 64..127 bytes rather than 64..96 bytes, so if no branch, the 64..80 bytes setting will beyond the border using dstend. And the interval became longer just can benefit 96..127 bytes.
Unused valw is removed..