This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC PATCH] aarch64: improve memset

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Richard Henderson <rth at twiddle dot net>
Cc: libc-alpha <libc-alpha at sourceware dot org>, Marcus Shawcroft <marcus dot shawcroft at arm dot com>
Date: Fri, 20 Jun 2014 13:05:23 +0200
Subject: Re: [RFC PATCH] aarch64: improve memset
Authentication-results: sourceware.org; auth=none
References: <539BF47F dot 3030907 at twiddle dot net>

On Sat, Jun 14, 2014 at 12:06:39AM -0700, Richard Henderson wrote:
> The major idea here is to use IFUNC to check the zva line size once, and use
> that to select different entry points.  This saves 3 branches during startup,
> and allows significantly more flexibility.
> 
> Also, I've cribbed several of the unaligned store ideas that Ondrej has done
> with the x86 versions.
> 
> I've done some performance testing using cachebench, which suggests that the
> unrolled memset_zva_64 path is 1.5x faster than the current memset at 1024
> bytes and above.  The non-zva path appears to be largely unchanged.
> 
> I'd like to use some of Ondrej's benchmarks+data, but I couldn't locate them in
> a quick search of the mailing list.  Pointers?
> 
> Comments?
>
A benchmark that I currently use is here, which simply measures running
time of given command with different implementation, you need to
generate .so with memset for each memset variant then run a ./benchmark
and wait. I am not sure about a performance impact of unrolling as these
sizes tend to be relatively rare on apps that I measured.

http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2

What I got from that is bit chaotic, for example on AMD a gcc runs fastest with simple rep stosq loop
but other benchmarks say otherwise. It is in my priority list to update
memset based on that.


Then I have a profiler however it is currently x86 specific, it would
take some work to make it cross platform. Also it has limitation that it
does not measure effects of memset on caches which could skew a results.

A important part here is characteristic of data which are here:

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_gcc/result.html

It shows things like that data are almost always 8 byte aligned and
similar. A latest source is here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile130813.tar.bz2

References:
- [RFC PATCH] aarch64: improve memset
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]