This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Intel's new rte_memcpy()

On Fri, Jan 30, 2015 at 09:03:50AM -0800, H.J. Lu wrote:
> On Fri, Jan 30, 2015 at 5:52 AM, Luke Gorrie <> wrote:
> > Howdy!
> >
> > I am hoping for some feedback and advice for me as an application developer.
> >
> > Intel have recently posted a couple of memcpy() implementations and
> > suggested that these have significant advantages for networking
> > applications. There is one for Sandy Bridge and one for Haswell. The
> > proposal is that networking application developers would statically
> > link one or both of these into their applications instead of
> > dynamically linking with glibc. The proposal is part of their Data
> > Plane Development Kit (
> >
> > They explain it much better than I do:
> >
> >
> > and their code is here:
> >
> >> > My question to the list is this:
> > My question to the list is this:
> >
> > Should networking application developers adopt Intel's custom
> > implementation if (like me) they are absolutely dependent on good and
> > consistent performance of memcpy on all recent hardware (>= Sandy
> > Bridge) and Linux distributions? (and then -- what to do about
> > memmove?)
> >
> > I have done some cursory benchmarks with cachebench:
> >
> >
> > ... with a correction to the rte_memcpy on Haswell results:
> >
> >> 
Definitely not. You would need much more sophisticated memcpy that does
runtime profiling per call site to get consistent speedup. There are
several alternatives, one could be 50% faster than others but you need
runtime data to know which one.

As stated in original post cachebench is pretty bad benchmark. If you
randomize sizes and alignment its around 10% slower on in 1-1000 byte
range, as my profiler does.

For bigger sizes a benchmark speedup is questionable for similar reason.
It assumes that data is in L1 cache which in reality does not happen
that often for larger sizes, as you could have only 4 8kb buffers in
32kb L1 cache.

While new avx2 implementation with 8kb block is around 10% faster in L1 cache it is also 
10% around slower when memory is on L2 cache and beyond, see this graph switch to 16 block mode.
It looks that rep movsb is best for copying L2+ data so you need where
in your application is treshold where it happens.

While my benchmarks are more accurate they are still flawed in several
ways as they do not measure real workloads. I looked to code and its
badly optimized for small sizes. It would cause performance regression
for compiling with gcc, see following profile.

Final problem is always inlining memcpy. Its already problem in gcc that
already does too much of memcpy inline expansion with suboptimal code.
While this may give benefit for few hottest memcpy callers it harm
performance for others. One copy of rtx_memcpy has 8kb and when it is
not in instruction cache you par 300 cycle performance penalty, see
following benchmark that simulates that situation.

A full profiler is here

My suggestion is simple: test it. Take your application do profiling to
identify most frequent memcpy caller, replace it by rtx_memcpy, run you
application if that given performance gain and repeat as necessary. I
cannot know what implementation is best for your workload until I see
what workload you use.

> I import it to hjl/memcpy branch at
> Here is the bench-memcpy comparison against __memcpy_avx_unaligned
> on Haswell:
No, these benchmarks are junk as I mentioned in several previous

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]