This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
Re: [PATCH, AARCH64] Optimized memset
- From: pinskia at gmail dot com
- To: Wilco Dijkstra <wdijkstr at arm dot com>
- Cc: "<newlib at sourceware dot org>" <newlib at sourceware dot org>
- Date: Wed, 8 Jul 2015 11:01:01 -0700
- Subject: Re: [PATCH, AARCH64] Optimized memset
- Authentication-results: sourceware.org; auth=none
- References: <000801d0b98f$96ad0e10$c4072a30$ at com>
> On Jul 8, 2015, at 8:05 AM, Wilco Dijkstra <wdijkstr@arm.com> wrote:
>
> This is an optimized memset for AArch64. Memset is split into 4 main cases: small sets of up to 16
> bytes, medium of 16..96 bytes which are fully unrolled. Large memsets of more than 96 bytes align
> the destination and use an unrolled loop processing 64 bytes per iteration. Memsets of zero of more
> than 256 use the dc zva instruction, and there are faster versions for the common ZVA sizes 64 or
> 128. STP of Q registers is used to reduce codesize without loss of performance.
Since you are using the vector register already, why not avoid using the multiply and just do:
dup followed by a fmov? At least for thunderX, this will be faster than doing the dup and the mult. That is thunderX has a cheap fmov between vector and gprs (2 cycles).
Thanks,
Andrew
>
> ChangeLog:
> 2015-07-08 Wilco Dijkstra <wdijkstr@arm.com>
>
> * newlib/libc/machine/aarch64/memset.S (memset):
> Rewrite of optimized memset.
>
> OK for commit?
> <0003-Optimized-memset.txt>