This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC PATCH] aarch64: improve memset
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Henderson <rth at twiddle dot net>
- Cc: libc-alpha <libc-alpha at sourceware dot org>, Marcus Shawcroft <marcus dot shawcroft at arm dot com>
- Date: Fri, 20 Jun 2014 13:05:23 +0200
- Subject: Re: [RFC PATCH] aarch64: improve memset
- Authentication-results: sourceware.org; auth=none
- References: <539BF47F dot 3030907 at twiddle dot net>
On Sat, Jun 14, 2014 at 12:06:39AM -0700, Richard Henderson wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points. This saves 3 branches during startup,
> and allows significantly more flexibility.
>
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
>
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above. The non-zva path appears to be largely unchanged.
>
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list. Pointers?
>
> Comments?
>
A benchmark that I currently use is here, which simply measures running
time of given command with different implementation, you need to
generate .so with memset for each memset variant then run a ./benchmark
and wait. I am not sure about a performance impact of unrolling as these
sizes tend to be relatively rare on apps that I measured.
http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2
What I got from that is bit chaotic, for example on AMD a gcc runs fastest with simple rep stosq loop
but other benchmarks say otherwise. It is in my priority list to update
memset based on that.
Then I have a profiler however it is currently x86 specific, it would
take some work to make it cross platform. Also it has limitation that it
does not measure effects of memset on caches which could skew a results.
A important part here is characteristic of data which are here:
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_gcc/result.html
It shows things like that data are almost always 8 byte aligned and
similar. A latest source is here.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile130813.tar.bz2