This is the mail archive of the
mailing list for the libc-ports project.
Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- From: "Ryan S. Arnold" <ryan dot arnold at gmail dot com>
- To: Siddhesh Poyarekar <siddhesh at redhat dot com>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>, OndÅej BÃlka <neleai at seznam dot cz>, Will Newton <will dot newton at linaro dot org>, "libc-ports at sourceware dot org" <libc-ports at sourceware dot org>, Patch Tracking <patches at linaro dot org>
- Date: Wed, 4 Sep 2013 12:35:46 -0500
- Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- Authentication-results: sourceware.org; auth=none
- References: <CANu=DmiBHoymFKTvaW_VsdhWZEYwkfViz1tTeRgj7H80f0FntA at mail dot gmail dot com> <5220D30B dot 9080306 at redhat dot com> <CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ at mail dot gmail dot com> <5220F1F0 dot 80501 at redhat dot com> <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw at mail dot gmail dot com> <52260BD0 dot 6090805 at redhat dot com> <20130903173710 dot GA2028 at domone dot kolej dot mff dot cuni dot cz> <522621E2 dot 6020903 at redhat dot com> <20130903185721 dot GA3876 at domone dot kolej dot mff dot cuni dot cz> <5226354D dot 8000006 at redhat dot com> <20130904073008 dot GA4306 at spoyarek dot pnq dot redhat dot com>
On Wed, Sep 4, 2013 at 2:30 AM, Siddhesh Poyarekar <email@example.com> wrote:
> 1. Assume aligned input. Nothing should take (any noticeable)
> performance away from align copies/moves
> 2. Scale with size
In my experience scaling with data-size isn't really possible beyond a
certain point. We pick a target range of sizes to optimize for based
upon customer feedback and we try to use pre-fetching in that range as
efficiently as possible. But I get your point. We don't want any
particular size to be severely penalized.
Each architecture and specific platform needs to know/decide what the
optimal range is and document it. Even for Power we have different
expectations on server hardware like POWER7, vs. embedded hardware
like ppc 476.
> 3. Provide acceptable performance for unaligned sizes without
> penalizing the aligned case
There are cases where the user can't control the alignment of the data
being fed into string functions, and we shouldn't penalize them for
these situations if possible, but in reality if a string routine shows
up hot in a profile this is a likely culprit and there's not much that
can be done once the unaligned case is made as stream-lined as
Simply testing for alignment (not presuming aligned data) itself slows
down the processing of aligned-data, but that's an unavoidable
reality. I've chatted with some compiler folks about the possibility
of branching directly to aligned case labels in string routines if the
compiler is able to detect aligned data.. but was informed that this
suggestion might get me burned at the stake.
As previously discussed, we might be able to use tunables in the
future to mitigate this. But of course, this would be 'use at your
> 4. Measure the effect of dcache pressure on function performance
> 5. Measure effect of icache pressure on function performance.
> Depending on the actual cost of cache misses on different processors,
> the icache/dcache miss cost would either have higher or lower weight
> but for 1-3, I'd go in that order of priorities with little concern
> for unaligned cases.
I know that icache and dcache miss penalty/costs are known for most
architectures but not whether they're "published". I suppose we can,
at least, encourage developers for the CPU manufacturers to indicate
in the documentation of preconditions which is more expensive,
relative to the other if they're unable to indicate the exact costs of
Some further thoughts (just to get this stuff documented):
Some performance regressions I'm familiar with (on Power), which CAN
be measured with a baseline micro-benchmark regardless of use-case:
1. Hazard/Penalties - I'm thinking things like load-hit-store in the
tail of a loop, e.g., label: load value from a register, do work,
store to same register, branch to loop. Take a stall when the value
at the top of the loop isn't ready to load.
2. Dispatch grouping - Some instructions need to be first-in-group,
etc. Grouping is also based on instruction alignment. At least on
Power I believe some instructions benefit from specific alignment.
3. Instruction Grouping - Depending on topology of the pipeline,
specific groupings of instructions of might incur pipeline stalls due
to unavailability of the load/store unit (for instance).
4. Facility usage costs - Sometimes using certain facilities for
certain sizes of data are more costly than not using the facility.
For instance, I believe that using the DFPU on Power requires that the
floating-point pipeline be flushed, so BFP and DFP really shouldn't be
used together. I believe there is a powerpc32 string function which
uses FPRs because they're 64-bits wide even on ppc32. But we measured
the cost/benefit ratio of using this vs. not.
On Power, micro benchmarks are run in-house with these (and many
other) factors in mind.