This is the mail archive of the
mailing list for the libc-ports project.
Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: "Ryan S. Arnold" <ryan dot arnold at gmail dot com>
- Cc: Will Newton <will dot newton at linaro dot org>, "libc-ports at sourceware dot org" <libc-ports at sourceware dot org>, Patch Tracking <patches at linaro dot org>, OndÅej BÃlka <neleai at seznam dot cz>, Siddhesh Poyarekar <siddhesh at redhat dot com>
- Date: Tue, 03 Sep 2013 19:31:32 -0400
- Subject: Re: [PATCH] sysdeps/arm/armv7/multiarch/memcpy_impl.S: Improve performance.
- Authentication-results: sourceware.org; auth=none
- References: <520894D5 dot 7060207 at linaro dot org> <CANu=DmiBHoymFKTvaW_VsdhWZEYwkfViz1tTeRgj7H80f0FntA at mail dot gmail dot com> <5220D30B dot 9080306 at redhat dot com> <CANu=DmiXLL9v1Z1KS0sBOs-pL8csEUGc9YE829_-tidKd-GruQ at mail dot gmail dot com> <5220F1F0 dot 80501 at redhat dot com> <CANu=DmhA9QvSe6RS72Db2P=yyjC72fsE8d4QZKHEcNiwqxNMvw at mail dot gmail dot com> <52260BD0 dot 6090805 at redhat dot com> <CAAKybw99YcSoyU58w2iqHGRTQpajAtKX6JZp=r57bT37fjvQ2Q at mail dot gmail dot com> <52263E63 dot 2080301 at redhat dot com> <CAAKybw_7VE3zYM1Vb4sfE-HRMMdCx2E9Obf45_11=bGjVZXeJQ at mail dot gmail dot com>
On 09/03/2013 04:56 PM, Ryan S. Arnold wrote:
> On Tue, Sep 3, 2013 at 2:54 PM, Carlos O'Donell <firstname.lastname@example.org> wrote:
>> The current set of performance preconditions are baked into the experience
>> of the core developers reviewing patches. I want the experts out of the
> This is the clutch.
> Developers working for the CPU manufacturers are privy to a lot of
> unpublished timing, penalty/hazard information, as well as proprietary
> pipeline analysis tools.
> Will "J. Random Hacker" working for MegaCorp tell you that the reason
> he's chosen a particular instruction sequence is because the system
> he's working on has a tiny branch cache (the size of which might be
That's an interesting point. I've seen similar things happen in *.md
file generation in gcc and had forgotten about it entirely. However,
in all such instances the developer can say "I am bound by confidentiality
not to reveal the reasons why I made my choices." We can document that.
The opposite point is also true in that we might actually tune an
implementation based on real-world data and have no idea why it behaves
optimally from an architectural perspective. In which case we have to
document "This implementation was tuned using data set X."
How are these two situations any different really?
I would accept patches in both cases, but I would like to see that
MegaCorp's patches don't change anything for the consumer level CPUs
we routinely employ.
>>> PowerPC has had the luxury of not having their performance
>>> pre-conditions contested. PowerPC string performance is optimized
>>> based upon customer data-set analysis. So PowerPC's preconditions are
>>> pretty concrete... Optimize for aligned data in excess of 128-bytes
>>> (I believe).
>> We should be documenting this somewhere, preferably in a Power-specific
>> test that looks at just this kind of issue.
> I might be mistaken, but I think you'll find these preconditions
> explicitly defined in the string function implementation source files
> for PowerPC.
Excellent. We should probably have some more central location for this
information in a developer's guide or internals guide.
>> Documenting this statically is the first, in my opinion, stepping stone
>> to having something like dynamic feedback.
Glad we agree.
>>> Unless technology evolves that you can statistically analyze data in
>>> real time and adjust the implementation based on what you find (an
>>> implementation with a different set of preconditions) to account for
>>> this you're going to end up with a lot of in-fighting over
>> Why do you assume we'll have a lot of in-fighting over performance?
> I'm projecting here. If someone proposed to adjust the PowerPC
> optimized string functions to their own preconditions and it
> drastically changed the performance of existing customers, or future
> customers you'd see me panic.
At least with a microbenchmark you'd know the situation was about to
go sideways immediately after running the benchmark. Otherwise it might
be quite a while before you do profiling before a big tools release.
This is exactly the situation we are in right now for x86 and x86-64,
and will likely see with ARM/AArch64 and the plethora of vendor
>> At present we've split the performance intensive (or so we believe)
>> routines on a per-machine basis. The arguments are then going to be
>> had only on a per-machine basis, and even then for each hardware
>> variant can have an IFUNC resolver select the right routine at
> Right, selecting the right variant with IFUNC has certainly helped
> platforms that didn't use optimized libraries. This is the low
> hanging fruit. So now our concern is the proliferation of micro-tuned
> variants and a lack of qualified eyes to objectively review the
Yes, that's a concern.
>> Then we come upon the tunables that should allow some dynamic adjustment
>> of an algorithm based on realtime data.
> Yes, you can do this with tunables if the developer knows something
> about the data (more about that later).
We can't always assume ignorance, but I agree that there are problems
>>> I've run into situations where I recommended that a customer code
>>> their own string function implementation because they continually
>>> encountered unaligned-data when copying-by-value in C++ functions and
>>> PowerPC's string function implementations penalized unaligned copies
>>> in preference for aligned copies.
>> Provide both in glibc and expose a tunable?
> So do we (the glibc community) no longer consider the proliferation of
> tunables to be a mortal sin? Or was that only with regard to
> configuration options? Regardless, it still burdens the Linux
> distributions and developers who have to provide QA.
We need to have a broader conversation about this very issue.
Pragmatically configuration and tunables are similar problems.
I think tunables are a mortal sin and would rather see algorithms
that can select the best implementation for the user automatically.
However, we need tunables to bootstrap that. I would hope that as
tunables appear they disappear into smart algorithms that do a better
job of tunning internals.
> If tunables are available, then trial-and-error would help where a
> user doesn't know the particulars of his application's data usage.
> Using tunables is potentially problematic as well. Often testing a
> condition in highly optimized code is enough to obviate the
> performance benefit you're attempting to provide. Checking for feature
> availability might consume enough cycles to make it senseless to use
> the facility itself. I believe this is what happened in the early
> days trying to use VMX in string routines.
Agreed. You have to make it configurable then, and that's costly from
a complexity perspective, and makes testing all build configurations
harder and harder.
> Additionally, while dynamically linked applications won't suffer from
> using IFUNC resolved functions (because of mandatory PLT usage), glibc
> internal usage of IFUNC resolved functions very likely will if/when
> forced to go through the PLT, especially on systems like PowerPC where
> indirect branching is more expensive than direct branching. When
> Adhemerval's PowerPC IFUNC patches go in I'll probably argue for
> keeping a 'generic' optimized version for internal libc usage. We'll
> see how it all works together.
> So using tunables alone isn't necessarily a win unless it's coupled
> with IFUNC. But using IFUNC also isn't a guaranteed win in all cases.
Unless we try to come up wtih something to make it faster on Power?
> For external usage, Using IFUNC in combination with a tunable should
> be beneficial. For instance, on systems that don't have a concrete
> cacheline size (e.g., the A2 processor), at process initialization we
> query the system cacheline size, populate a static with the size, and
> then the string routines will query that size at runtime. It'd be
> nice to do that query at initialization and then pre-select an
> implementation based on cacheline size so we don't have to test for
> the cacheline size each time through the string function.
> This of course increases the cost of maintaining the string routines
> by having myriad of combinations.
Which we already have and need to test via the IFUNC testing functionality
that HJ added.
> These are all the trade-offs we weigh.
... and more.
I see your concerns, and raise you a couple more.
If we leave the situation as is we will have a continued and difficult
time accepting performance patches for functional units of the library.
We need to drive objectivity into the evaluation of the patches. It won't
be completely objective at first, but it better move in that direction.