This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Ling Ma <ling dot ma dot program at gmail dot com>
- Cc: libc-alpha at sourceware dot org, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, yumkam at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Thu, 15 May 2014 22:22:13 +0200
- Subject: Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- Authentication-results: sourceware.org; auth=none
- References: <1398055946-4493-1-git-send-email-ling dot ma at alipay dot com> <CAOGi=dOQEbbkkzQGz-ZtQ0-WEHj2=hjmbstZXvZyLqycVy18Kg at mail dot gmail dot com>
On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
> If there are still some issues on the latest memcpy and memset, please
> let us know.
>
> Thanks
> Ling
>
> 2014-04-21 12:52 GMT+08:00, ling.ma.program@gmail.com
> <ling.ma.program@gmail.com>:
> > From: Ling Ma <ling.ml@alibaba-inc.com>
> >
> > In this patch we take advantage of HSW memory bandwidth, manage to
> > reduce miss branch prediction by avoiding using branch instructions and
> > force destination to be aligned with avx instruction.
> >
> > The CPU2006 403.gcc benchmark indicates this patch improves performance
> > from 6% to 14%.
> >
> > This version only jump to backward for memove overlap case,
> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
As now it is slower than a gcc compilation time becomes around
0.12% slower than pending sse2 version and indistingushible from current
version.
I used a benchmark that measures total running time of gcc for five
hours and report relative time and variance, you could get it here
http://kam.mff.cuni.cz/~ondra/memcpy_consistency_benchmark.tar.bz2
a results I got on haswell are
memcpy-avx.so memcpy-sse2.so memcpy-sse2_v2.so memcpy_fuse.so memcpy_rep8.so nul.so
100.25% +- 0.04% 100.25% +- 0.04% 100.13% +- 0.07% 100.00% +- 0.04% 100.34% +- 0.13% 100.95% +- 0.07%
where I tried fusion and rep strategy like in memset which helps.
I tried also to measure it with my benchmark on different function, it
claims that pending sse2 version is best on gcc+gnuplot load. When I
looked to graph it looks that it loses on much branching until it gets
to small sizes, see
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx.html
with profiler here http://kam.mff.cuni.cz/~ondra/benchmark_string/memcpy_profile_avx150514.tar.bz2
Longer inputs are faster with avx2 but they do not occur that often.
One reason for inconsistent results is that memcpy affect stores after
call depending on what pending stores it creates and that cannot be
measured with memcpy running time alone.
Like in memset I checked big inputs and loop is around 20% faster than
rep movsq
time LD_PRELOAD=./memcpy-avx.so ./big
time LD_PRELOAD=./memcpy_rep8.so ./big
with following program
#include <stdlib.h>
#include <string.h>
int main(){
int i;
char *x=malloc(100000080)+rand()%64;
char *y=malloc(100000080)+rand()%64;
for (i=0;i<10;i++)
memcpy(x,y,100000000);
}