This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] aarch64: Optimized memset for Kunpeng processor.


Hi Wilco,

> Would it not make more sense to use traditional unrolling? Eg. process
> 128 or 256 bytes per iteration instead of 3x 64?

Well, I think it's not the unrolling in real sense, because there is a judgment whether reaches the tail after each 64 bytes setting. So there is no difference between 3x and 4x, and it's OK unrolling to 4x.

> You mean DC_ZVA does not work (ie. disabled in OS) or it doesn't give a speedup? 
> That sounds odd...

Well, I mean it did not give an so obviously speedup in generic that using set_long with stp to set zero is better in Kunpeng processor. And it seems the DC_ZVA in latest version- memset_base64 has similar performance with our version.

> There is little point in branching over just 1 instruction - it's
> cheaper just to execute it than risk the misprediction (it would 
> need to use dstend rather than dstin).

Here the setting inerval if 64..127 bytes rather than 64..96 bytes, so if no branch, the 64..80 bytes setting will beyond the border using dstend. And the interval became longer just can benefit 96..127 bytes.

Unused valw is removed..

Cheers,
Xuelei


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]