Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction

On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
> If there are still some issues on the latest memcpy and memset, please
> let us know.
> Thanks
> Ling
> 2014-04-21 12:52 GMT+08:00,
> <>:
> > From: Ling Ma <>
> >
> > In this patch we take advantage of HSW memory bandwidth, manage to
> > reduce miss branch prediction by avoiding using branch instructions and
> > force destination to be aligned with avx instruction.
> >
> > The CPU2006 403.gcc benchmark indicates this patch improves performance
> > from 6% to 14%.
> >
> > This version only jump to backward for memove overlap case,
> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.

As now it is slower than a gcc compilation time becomes around
0.12% slower than pending sse2 version and indistingushible from current

I used a benchmark that measures total running time of gcc for five
hours and report relative time and variance, you could get it here

a results I got on haswell are 
     100.25% +- 0.04%    100.25% +- 0.04%    100.13% +- 0.07%    100.00% +- 0.04%    100.34% +- 0.13%    100.95% +- 0.07%

where I tried fusion and rep strategy like in memset which helps.

I tried also to measure it with my benchmark on different function, it
claims that pending sse2 version is best on gcc+gnuplot load. When I
looked to graph it looks that it loses on much branching until it gets
to small sizes, see
with profiler here

Longer inputs are faster with avx2 but they do not occur that often.

One reason for inconsistent results is that memcpy affect stores after
call depending on what pending stores it creates and that cannot be
measured with memcpy running time alone.

Like in memset I checked big inputs and loop is around 20% faster than
rep movsq

time LD_PRELOAD=./ ./big
time LD_PRELOAD=./ ./big

with following program

#include <stdlib.h>
#include <string.h>
int main(){
 int i;
 char *x=malloc(100000080)+rand()%64;
 char *y=malloc(100000080)+rand()%64;

  for (i=0;i<10;i++)

