This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: OndÅej BÃlka <neleai at seznam dot cz>, GNU C Library <libc-alpha at sourceware dot org>, Liubov Dmitrieva <liubov dot dmitrieva at gmail dot com>, yumkam at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Fri, 6 Jun 2014 23:08:16 +0800
- Subject: Re: [PATCH RFC] Imporve 64bit memcpy performance for Haswell CPU with AVX instruction
- Authentication-results: sourceware.org; auth=none
- References: <1398055946-4493-1-git-send-email-ling dot ma at alipay dot com> <CAOGi=dOQEbbkkzQGz-ZtQ0-WEHj2=hjmbstZXvZyLqycVy18Kg at mail dot gmail dot com> <20140515202213 dot GA20667 at domone dot podge> <CAOGi=dNbyxj+7gjwcpAVBxYB-MH9E7s=xi2nKwYXkDViasOZrA at mail dot gmail dot com> <CAMe9rOpC5-p7DV=xBfhUknkruz2-Ek+Bpzm+ycZiKdXtSyXxiA at mail dot gmail dot com>
the 0002-memcpy-avx.patch in http://www.yunos.org/tmp/test.memcpy.memset.zip
is our update version, and I will gzip and send it as attachment.
2014-06-06 1:33 GMT+08:00, H.J. Lu <firstname.lastname@example.org>:
> On Tue, May 20, 2014 at 8:17 AM, Ling Ma <email@example.com> wrote:
>> 2014-05-16 4:22 GMT+08:00, OndÅej BÃlka <firstname.lastname@example.org>:
>>> On Fri, May 09, 2014 at 08:40:46PM +0800, Ling Ma wrote:
>>>> If there are still some issues on the latest memcpy and memset, please
>>>> let us know.
>>>> 2014-04-21 12:52 GMT+08:00, email@example.com
>>>> > From: Ling Ma <firstname.lastname@example.org>
>>>> > In this patch we take advantage of HSW memory bandwidth, manage to
>>>> > reduce miss branch prediction by avoiding using branch instructions
>>>> > and
>>>> > force destination to be aligned with avx instruction.
>>>> > The CPU2006 403.gcc benchmark indicates this patch improves
>>>> > performance
>>>> > from 6% to 14%.
>>>> > This version only jump to backward for memove overlap case,
>>>> > Thanks for Ondra'comments, and that Yuriy gave me c code hint on it.
>>> As now it is slower than a gcc compilation time becomes around
>>> 0.12% slower than pending sse2 version and indistingushible from current
>>> I used a benchmark that measures total running time of gcc for five
>>> hours and report relative time and variance, you could get it here
>>> a results I got on haswell are
>>> memcpy-avx.so memcpy-sse2.so memcpy-sse2_v2.so
>>> memcpy_fuse.so memcpy_rep8.so nul.so
>>> 100.25% +- 0.04% 100.25% +- 0.04% 100.13% +- 0.07% 100.00%
>>> 0.04% 100.34% +- 0.13% 100.95% +- 0.07%
>>> where I tried fusion and rep strategy like in memset which helps.
>>> I tried also to measure it with my benchmark on different function, it
>>> claims that pending sse2 version is best on gcc+gnuplot load. When I
>>> looked to graph it looks that it loses on much branching until it gets
>>> to small sizes, see
>>> with profiler here
>> Ling: we move less_16bytes to code entry, so there are no degradation
>> for small size , attached code and your
>> meanwhile we also tested pending memcpy, it is much better than original
>> but avx still give us the best result for large input(we can download
>> and run it):
> Any updates on this? Where is the latest AVX2 memcpy patch?
> I didn't see it at