This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 0/2] Multiarch hooks for memcpy variants
Apologies for the multiple versions of this message (again).
I've got to learn to edit my complete message outside email since I can't
seem to fix my habit of typing "^S" (save text) every time
I complete a paragraph. Unfortunately, Thunderbird treats
^S as a send command.
On 8/15/2017 3:10 PM, Patrick McGehearty wrote:
On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
It is important to be careful about overemphasizing the frequency of
short memcpy calls.
Even though a high percentage of memcpy calls are short, my experience
is that a high
Siddhesh Poyarekar wrote:
The first part is not true for falkor since its implementation is a good
10-15% faster on the falkor chip due to its design differences. glibc
makes pretty extensive use of memcpy throughout, but I don't have data
on how much difference a core-specific memcpy will make there, so I
don't have enough grounds for a generic change.
66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
for these small sizes (there is very little you can do different), that's at most 1
cycle faster, so the PLT indirection is going to be more expensive.
percentage of time spent in memcpy is on longer copies.
Following example is just that, an example, not an expression of any
specific real application behavior:
If 66% of calls are <=16 bytes (average length=8, say) but the
average length of the remaining
1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast
majority of time
in memcpy would be in the longer copies.
My experience with tuning libc memcpy off and on on multiple platforms
is that copies
of length > 256 bytes are the ones that affect overall application
performance. Really short
copies where the length and/or alignment might be known at compile
time are best handled
by inlining the copy.
I've produced platform specific optimizations for memcpy many times
over the years. By platform
specific, I mean different code for different generations/platforms of
the same architecture.
These) which shows improvements from at little as 10% to as much as 250%
depending on how close the memory architecture of latest platform is
to the prior platform.
Typical factors that can influence best memcpy performance a specific
platform for a given architecture include:
ideal prefetch distance ... depends on processor speed, cache/memory
latency, depth of memory
subsystem queues, details of memory subsystem priorities for
prefetch vs demand fetch, and more.
number of alu operations that can be issued per cycle
number of memory operations that can be issued per cycle
number of total instructions that can be issued per cycle
branch misprediction latency; branch predictor behavior; other branch
and many other architectural features which can make occasional
I find it hard to imagine a single generic memcpy library routine that
can match the performance
of a platform specific tuned routine over a typical range of copy
lengths, assuming the architecture
has been around long enough to go through several semiconductor
With dynamic linking, the overhead of using platform specific code for
called should be relatively minimal.
I do agree a good generic version should be available as the effort of
finding the best tuning
for a particular platform can take weeks and not all
will get that intense attention.
- patrick mcgehearty
Your last point about hurting everything else is very valid though; it's
very likely that adding an extra indirection in cases where
__memcpy_generic is going to be called anyway is going to be expensive
given that a bulk of the memcpy calls will be for small sizes of less
Note that the falkor version does quite well in memcpy-random across several
micro architectures so I think parts of it could be moved into the generic code.
Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
waiver in check_localplt and that would become a blanket OK for PLT
usage for memcpy, which we don't want. Hence my patch is probably the
best compromise, especially since there is precedent for the approach in
I still can't see any reason to even support these entry points in GLIBC, let
alone optimize them using ifuncs. The _chk functions should obviously be
inlined to avoid all the target specific complexity for no benefit. I think this
could trivially be done via the GLIBC headers already. (That's assuming they
are in any way performance critical.)