This is the mail archive of the
libc-ports@sources.redhat.com
mailing list for the libc-ports project.
Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: "Ryan S. Arnold" <ryan dot arnold at gmail dot com>
- Cc: Siddhesh Poyarekar <siddhesh at redhat dot com>, Carlos O'Donell <carlos at redhat dot com>, Will Newton <will dot newton at linaro dot org>, "libc-ports at sourceware dot org" <libc-ports at sourceware dot org>, Patch Tracking <patches at linaro dot org>
- Date: Thu, 5 Sep 2013 13:06:57 +0200
- Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- Authentication-results: sourceware.org; auth=none
- References: <CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ at mail dot gmail dot com> <5220F1F0 dot 80501 at redhat dot com> <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw at mail dot gmail dot com> <52260BD0 dot 6090805 at redhat dot com> <20130903173710 dot GA2028 at domone dot kolej dot mff dot cuni dot cz> <522621E2 dot 6020903 at redhat dot com> <20130903185721 dot GA3876 at domone dot kolej dot mff dot cuni dot cz> <5226354D dot 8000006 at redhat dot com> <20130904073008 dot GA4306 at spoyarek dot pnq dot redhat dot com> <CAAKybw87cyx67bpX=qjedrfjKxDmtgOfi_zCiaCfHGgx328Bsw at mail dot gmail dot com>
On Wed, Sep 04, 2013 at 12:35:46PM -0500, Ryan S. Arnold wrote:
> On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <siddhesh@redhat.com> wrote:
> > 3. Provide acceptable performance for unaligned sizes without
> > penalizing the aligned case
>
> There are cases where the user can't control the alignment of the data
> being fed into string functions, and we shouldn't penalize them for
> these situations if possible, but in reality if a string routine shows
> up hot in a profile this is a likely culprit and there's not much that
> can be done once the unaligned case is made as stream-lined as
> possible.
>
> Simply testing for alignment (not presuming aligned data) itself slows
> down the processing of aligned-data, but that's an unavoidable
> reality.
How expensive are unaligned loads on powerpc? On x64 a penalty for
using them is smaller than alternatives(increased branch
misprediction...)
> I've chatted with some compiler folks about the possibility
> of branching directly to aligned case labels in string routines if the
> compiler is able to detect aligned data.. but was informed that this
> suggestion might get me burned at the stake.
>
You would need to improve gcc detection of alignments first. Now gcc
misses most of opportunities, even in following code gcc issues
retundant alignment checks:
#include <stdint.h>
char *foo(long *x){
if (((uintptr_t)x)%16)
return x+4;
else {
__builtin_memset(x,0,512);
return x;
}
}
If gcc guys fix that then we do not have to ask them anything. We could
just change headers to recognize aligned case like
#define strchr(x,c) ({ char *__x=x;\
if (__builtin_constant_p(((uintptr_t)__x)%16) && !((uintptr_t)__x)%16)\
strchr_aligned(__x,c);\
else\
strchr(__x,c);\
})
> > 4. Measure the effect of dcache pressure on function performance
> > 5. Measure effect of icache pressure on function performance.
> >
> > Depending on the actual cost of cache misses on different processors,
> > the icache/dcache miss cost would either have higher or lower weight
> > but for 1-3, I'd go in that order of priorities with little concern
> > for unaligned cases.
>
> I know that icache and dcache miss penalty/costs are known for most
> architectures but not whether they're "published". I suppose we can,
> at least, encourage developers for the CPU manufacturers to indicate
> in the documentation of preconditions which is more expensive,
> relative to the other if they're unable to indicate the exact costs of
> these misses.
>
These cost are relatively difficult to describe, take strlen on main
memory as example.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strlen_profile/results_rand_nocache/result.html
Here we see hardware prefetcher in action. A time goes linearly with
size until 512 bytes and remains constant until 4096 bytes(switch to
block view) where it starts increasing at slower rate.
For core2 shape is similar except that plateau starts at 256 bytes and
ends at 1024 bytes.
http://kam.mff.cuni.cz/~ondra/benchmark_string/core2/strlen_profile/results_rand_nocache/result.html
AMD processors are different, phenomII performance is line, and for fx10
there is even area where time decreases with size.
http://kam.mff.cuni.cz/~ondra/benchmark_string/phenomII/strlen_profile/results_rand_nocache/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/strlen_profile/results_rand_nocache/result.html