This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: neleai at seznam dot cz
- Cc: libc-alpha at sourceware dot org, rth at twiddle dot net, aj at suse dot com, liubov dot dmitrieva at gmail dot com, hjl dot tools at gmail dot com, Ling Ma <ling dot ml at alibaba-inc dot com>
- Date: Thu, 15 May 2014 09:05:53 +0800
- Subject: Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction
- Authentication-results: sourceware.org; auth=none
- References: <1396850238-29041-1-git-send-email-ling dot ma at alipay dot com> <20140513173616 dot GC5047 at domone dot podge>
2014-05-14 1:36 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
>
>
> On Mon, Apr 07, 2014 at 01:57:18AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ling Ma <ling.ml@alibaba-inc.com>
>>
>> In this patch we take advantage of HSW memory bandwidth, manage to
>> reduce miss branch prediction by avoid using branch instructions and
>> force destination to be aligned with avx instruction.
>>
> Now when we have a haswell machine on our department I tested this
> implementation. Benchmark used and results are here.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx130514.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx.html
>
> This patch improves large inputs and does not regress
> small inputs much which gives a total 10% improvement on gcc test, it
> could be improved but it now looks good enough.
Ling: Thanks Ondra! you give us many good suggestions and encouragement
> I tried two alternatives. First is using avx2 in header(memset_fuse).
> It look it helps, it adds additional 0.5% of performance. However I tried
> to
> crosscheck this with bash shell where comparison is in opposite
> direction so I not entirely sure yet, see
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_bash/result.html
>
Ling: Yes, we did the experiment on our 403.gcc(I list the download
address below), it slows performance between 0 and 256 bytes although
another benchmark gave us good result, so I said it hurt performance
in my last email.
>
> Second is checking if rep treshold is best one,
> this depends on application cache layout I do not have definite answer
> yet (memset_rep and memset_avx_v2 variants), when data is in L2 cache we
> could lower treshold to 1024 bytes but it slows real inputs for some
> reason.
>
Ling: Yes, because of this reason, I once tried to use prefetch
instruction in the code, and it will also hurt performance when the
data is in L1.
>
>> The CPU2006 403.gcc benchmark also indicate this patch improves
>> performance
>> from 22.9% to 59% compared with original memset implemented by sse2.
>>
> I inspected that benchmark with my profiler is not that good as its only
> simple
> part of gcc and two third of total time is spend on 240 long inputs.
Ling: please download www.yunos.org/tmp/test.memcpy.memset.zip, which
includes our whole benchmark ,readme.txt and result.xls. we can run &
check it.
>
> A large part of speedup could be explained that avx2 implementation has
> a special case branch for 128-256 byte range but current one uses loop.
> These distributions are different from other program and running gcc
> itself as short inputs are more common there.
>
>
>> + ALIGN(4)
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> + mov $SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> + mov __x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> + shl $4, %r9
>> + cmp %r9, %rdx
>> + ja L(gobble_big_data)
>> + mov %rax, %r9
>> + mov %esi, %eax
>> + mov %rdx, %rcx
>> + rep stosb
>> + mov %r9, %rax
>> + vzeroupper
>> + ret
>> +
>> + ALIGN(4)
>> +L(gobble_big_data):
>> + sub $0x80, %rdx
>> +L(gobble_big_data_loop):
>> + vmovntdq %ymm0, (%rdi)
>> + vmovntdq %ymm0, 0x20(%rdi)
>> + vmovntdq %ymm0, 0x40(%rdi)
>> + vmovntdq %ymm0, 0x60(%rdi)
>> + lea 0x80(%rdi), %rdi
>> + sub $0x80, %rdx
>> + jae L(gobble_big_data_loop)
>> + vmovups %ymm0, -0x80(%r8)
>> + vmovups %ymm0, -0x60(%r8)
>> + vmovups %ymm0, -0x40(%r8)
>> + vmovups %ymm0, -0x20(%r8)
>> + vzeroupper
>> + sfence
>> + ret
>
> That loop does seem to help on haswell at all, It is indistingushible from
> rep stosb loop above. I used following benchmark to check that with
> different sizes but performance stayed same.
>
> #include <stdlib.h>
> #include <string.h>
> int main(){
> int i;
> char *x=malloc(100000000);
> for (i=0;i<100;i++)
> MEMSET(x,0,100000000);
>
> }
>
>
> for I in `seq 1 10`; do
> echo avx
> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> echo rep
> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> done
Ling: Ok, I will test it seriously, then send out new version.
Thanks!
Ling