This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Andreas Jaeger <aj at suse dot com>
- Cc: Ondřej Bílka <neleai at seznam dot cz>, Nix <nix at esperi dot org dot uk>, libc-alpha at sourceware dot org, hongjiu dot lu at intel dot com, ling dot ml at alibaba-inc dot com
- Date: Mon, 10 Jun 2013 21:28:30 +0800
- Subject: Re: [PATCH 2/2] Improve 64bit memcpy/memmove for Corei7 with avx2 instruction
- References: <CAOGi=dMiD=_Qf1EJ=F3hfyQDtQubDEC5pjpXKDCHrUQwhr=vzg at mail dot gmail dot com> <20130605161954 dot GA26401 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPWPaX5prcL-uAaqS6=_ehzKeBmAFMdwV6aU34jZ0eHtQ at mail dot gmail dot com> <20130606125511 dot GA28565 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dPs9geCtrWhU1L_0DEfOWOknpzFSLmYs4gbYzGX8Zn5Hg at mail dot gmail dot com> <20130607104613 dot GA6343 at domone dot kolej dot mff dot cuni dot cz> <8761xqru5w dot fsf at spindle dot srvr dot nix> <CAOGi=dMV5jaS2597cksd0mW84UDd06SovsBkL5=WPez-jZWg4g at mail dot gmail dot com> <20130607160749 dot GA28961 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dP2s4k2rg8TdKwj6V9-VzbOORGzeBmh-G=Fr1eM_OyDoA at mail dot gmail dot com> <20130607184550 dot GA9683 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dN2TMG8wO97Wd1qFBUOZ7LrjTO21qP-fCPty6Mp3aOcHw at mail dot gmail dot com> <51B576CC dot 5000001 at suse dot com>
CPU2006 benchmark is very hard to improve so that the above 5%
improvement for single core may become the goal of next generation
CPU, and the improvement number is much less for benchmark specjbb. We
hardly accept above 1% improvement of those industry benchmarks only
for optimized memcpy_avx2 even though it is the fastest.
we presented the results because of 2 reasons:
1) Haswell CPU has full capability of handling indirect jump
instruction in memmcpy_avx2 in real-world scenario.
2)if we continue to test the benchmark for more times, we will find
which is better. For example we can test memcpy_avx2, memcpy_new over
3 times respectively , if we find which has more times of better
results, although the difference is very small, the stable results can
give us the right answer.
Thanks
Ling
2013/6/10, Andreas Jaeger <aj@suse.com>:
> On 06/10/2013 08:17 AM, Ling Ma wrote:
>> Last week, we separated 403.gcc from cpu2006 benchmark and compiled
>> with additional option -mstringop-strategy=libcall to avoid rep_4byte,
>> rep_8byte, rep_byte that use rep movs instructions. 403.gcc has plenty
>> of branch instructions, and is very sensitive for branch prediction
>> miss rate. Currently we are concerning about whether memcpy_avx2 cause
>> more branch prediction miss over benefit from it in real world
>> scenario, so 403.gcc will help us to verify it.
>>
>> We tested 403.gcc linked with memcpy_new, 403.gcc linked with
>> memcpy_avx2 for 3 times respectively:
>>
>> 403.gcc for memcpy_new results are below: (bigger and better)
>> 1) 67.63718
>> 2) 66.899156
>> 3) 66.982456
>>
>> 403.gcc for memcpy_avx2 results are below:
>>
>> 1) 66.805236
>> 2) 67.29362
>> 3) 67.63718
>>
>> Above comparison results indicate memcpy_avx2 seem to be better,
>> and we would like to do more experiments.
>
>
> If I take the arithmetic mean of these I get:
> 67.17293066666666666666 vs 67.24534866666666666666
>
> That's far less than 1 percent - so not conclusive at all,
>
> Andreas
> --
> Andreas Jaeger aj@{suse.com,opensuse.org} Twitter/Identica: jaegerandi
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
> GF: Jeff Hawn,Jennifer Guild,Felix Imendörffer,HRB16746 (AG Nürnberg)
> GPG fingerprint = 93A3 365E CE47 B889 DF7F FED1 389A 563C C272 A126
>