This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH, AARCH64] Optimized memset


> pinskia@gmail.com wrote:
> > On Jul 8, 2015, at 8:05 AM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
> >
> > This is an optimized memset for AArch64. Memset is split into 4 main cases: small sets of up
> to 16
> > bytes, medium of 16..96 bytes which are fully unrolled. Large memsets of more than 96 bytes
> align
> > the destination and use an unrolled loop processing 64 bytes per iteration. Memsets of zero
> of more
> > than 256 use the dc zva instruction, and there are faster versions for the common ZVA sizes
> 64 or
> > 128. STP of Q registers is used to reduce codesize without loss of performance.
> 
> Since you are using the vector register already, why not avoid using the multiply and just do:
> dup followed by a fmov?  At least for thunderX, this will be faster than doing the dup and the
> mult. That is thunderX has a cheap fmov between vector and gprs (2 cycles).

It turns out to be slightly faster on A53 too, so I've updated my patch.

Wilco

ChangeLog:
2015-07-10  Wilco Dijkstra  <wdijkstr@arm.com>

	* newlib/libc/machine/aarch64/memset.S (memset):
	Rewrite of optimized memset.

OK for commit?

Attachment: 0003-Optimized-memset.txt
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]