This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
> pinskia@gmail.com wrote: > > On Jul 8, 2015, at 8:05 AM, Wilco Dijkstra <wdijkstr@arm.com> wrote: > > > > This is an optimized memset for AArch64. Memset is split into 4 main cases: small sets of up > to 16 > > bytes, medium of 16..96 bytes which are fully unrolled. Large memsets of more than 96 bytes > align > > the destination and use an unrolled loop processing 64 bytes per iteration. Memsets of zero > of more > > than 256 use the dc zva instruction, and there are faster versions for the common ZVA sizes > 64 or > > 128. STP of Q registers is used to reduce codesize without loss of performance. > > Since you are using the vector register already, why not avoid using the multiply and just do: > dup followed by a fmov? At least for thunderX, this will be faster than doing the dup and the > mult. That is thunderX has a cheap fmov between vector and gprs (2 cycles). It turns out to be slightly faster on A53 too, so I've updated my patch. Wilco ChangeLog: 2015-07-10 Wilco Dijkstra <wdijkstr@arm.com> * newlib/libc/machine/aarch64/memset.S (memset): Rewrite of optimized memset. OK for commit?
Attachment:
0003-Optimized-memset.txt
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |