This is the mail archive of the
mailing list for the glibc project.
Re: [RFC] A method for forcing IFUNC selector
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Paul Pluzhnikov <ppluzhnikov at gmail dot com>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, Brooks Moses <bmoses at google dot com>
- Date: Fri, 7 Nov 2014 18:37:18 +0100
- Subject: Re: [RFC] A method for forcing IFUNC selector
- Authentication-results: sourceware.org; auth=none
- References: <CALoOobMYNLsv6NSmXqwj7j4kCx1XaQU9m0VExFMrtb3SVKpNxg at mail dot gmail dot com>
On Thu, Nov 06, 2014 at 12:12:04PM -0800, Paul Pluzhnikov wrote:
> This commit:
> commit 2d48b41c8fa610067c4d664ac2339ae6ca43e78c
> Author: Ondrej Bilka <email@example.com>
> Date: Mon May 20 08:20:00 2013 +0200
> Faster memcpy on x64.
> We add new memcpy version that uses unaligned loads which are fast
> on modern processors. This allows second improvement which is avoiding
> computed jump which is relatively expensive operation.
> Tests available here:
> changed the default memcpy selected on all of our processors from
> __memcpy_ssse3_back to __memcpy_sse2_unaligned.
> That caused a nice 2-3% improvement on some of our benchmarks (thanks!),
> but also 10-15% degradation on others (boo!).
Where is source of benchmarks? Without that its hard to say if they
measure something meaniful or just garbage in garbage out.
> It appears that for certain sizes and alignments, the new memcpy could be
> 50% slower than the old one.
You must take whole function into account and effects of branh
misprediction. A reason why new implementation is better when sizes vary.
A new implementation is 2.5 times faster on following loop.
for (i = 0; i < 10000000; i++)
memcpy (x + 16 * (i % 1024),y + 128 * (i % 1024), i % 64);
Also some sizes may be intentionally slower, as implemention uses decision tree
to determine which range to use putting rare ones at bottom will speed
up range of more common sizes. Also I send a patch that optimizes memcpy
bit more so situation could easily change.
> While we figure out how to re-tune our applications to get rid of the
> "slow" size/alignment memcpy()s, we'd like to keep the applications that
> suffer degradation on the old memcpy.
As I do not know if its bad workload or just chasing ghosts did you add
counter to show they happen sufficiently often?
Also try LD_PRELOAD your
application with memcpy so's here to see what is best. I was surprised
that relatively often simple rep stosq is fastest but sometimes it also
If its deterministic and fast write it into ./benchmark_action script
and run ./benchmark that will measure relative performance of each
> Unfortunately, glibc currently provides no way to do that .
> Proposal: a new environment variable, say LD_IFUNC_SELECTOR, that will
> contain semi-colon separated list of ifunc->implementation mappings that
> the end-user desires to force. E.g. for our degraded applications, we
> would set LD_IFUNC_SELECTOR to "memcpy=__memcpy_ssse3_back", while someone
> who also wanted to force strcmp to __strcmp_sse42 would set it to
That is bad as it break with new release when function gets
renamed/deleted/new supperior implementation is introduced.
would be LD_PRELOAD a custom implementation. Until I add a function
selection based on profile feedback one could gain substantial speedup
as best implementation for given workload is often more than 30% better
than compromise that needs work for all workloads.