This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction


2014-05-14 1:36 GMT+08:00, OndÅej BÃlka <neleai@seznam.cz>:
>
>
> On Mon, Apr 07, 2014 at 01:57:18AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ling Ma <ling.ml@alibaba-inc.com>
>>
>> In this patch we take advantage of HSW memory bandwidth, manage to
>> reduce miss branch prediction by avoid using branch instructions and
>> force destination to be aligned with avx instruction.
>>
> Now when we have a haswell machine on our department I tested this
> implementation. Benchmark used and results are here.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx130514.tar.bz2
> http://kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_avx.html
>
> This patch improves large inputs and does not regress
> small inputs much which gives a total 10% improvement on gcc test, it
> could be improved but it now looks good enough.
Ling: Thanks Ondra! you give us many good suggestions and encouragement

> I tried two alternatives. First is using avx2 in header(memset_fuse).
> It look it helps, it adds additional 0.5% of performance. However I tried
> to
> crosscheck this with bash shell where comparison is in opposite
> direction so I not entirely sure yet, see
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memset_profile_avx/results_bash/result.html
>
Ling: Yes, we did the experiment on our 403.gcc(I list the download
address below), it slows performance between 0 and 256 bytes although
another benchmark gave us good result, so I said it hurt performance
in my last email.

>
> Second is checking if rep treshold is best one,
> this depends on application cache layout I do not have definite answer
> yet (memset_rep and memset_avx_v2 variants), when data is in L2 cache we
> could lower treshold to 1024 bytes but it slows real inputs for some
> reason.
>
Ling: Yes, because of this reason, I once tried to use prefetch
instruction in the code, and  it will also hurt performance when the
data is in L1.
>
>> The CPU2006 403.gcc benchmark also indicate this patch improves
>> performance
>> from  22.9% to 59% compared with original memset implemented by sse2.
>>
> I inspected that benchmark with my profiler is not that good as its only
> simple
> part of gcc and two third of total time is spend on 240 long inputs.
Ling: please download www.yunos.org/tmp/test.memcpy.memset.zip, which
includes our whole benchmark ,readme.txt and result.xls. we can run &
check it.

>
> A large part of speedup could be explained that avx2 implementation has
> a special case branch for 128-256 byte range but current one uses loop.
> These distributions are different from other program and running gcc
> itself as short inputs are more common there.
>
>
>> +	ALIGN(4)
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> +	mov	__x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> +	shl	$4, %r9
>> +	cmp	%r9, %rdx
>> +	ja	L(gobble_big_data)
>> +	mov	%rax, %r9
>> +	mov	%esi, %eax
>> +	mov	%rdx, %rcx
>> +	rep	stosb
>> +	mov	%r9, %rax
>> +	vzeroupper
>> +	ret
>> +
>> +	ALIGN(4)
>> +L(gobble_big_data):
>> +	sub	$0x80, %rdx
>> +L(gobble_big_data_loop):
>> +	vmovntdq	%ymm0, (%rdi)
>> +	vmovntdq	%ymm0, 0x20(%rdi)
>> +	vmovntdq	%ymm0, 0x40(%rdi)
>> +	vmovntdq	%ymm0, 0x60(%rdi)
>> +	lea	0x80(%rdi), %rdi
>> +	sub	$0x80, %rdx
>> +	jae	L(gobble_big_data_loop)
>> +	vmovups	%ymm0, -0x80(%r8)
>> +	vmovups	%ymm0, -0x60(%r8)
>> +	vmovups	%ymm0, -0x40(%r8)
>> +	vmovups	%ymm0, -0x20(%r8)
>> +	vzeroupper
>> +	sfence
>> +	ret
>
> That loop does seem to help on haswell at all, It is indistingushible from
> rep stosb loop above. I used following benchmark to check that with
> different sizes but performance stayed same.
>
> #include <stdlib.h>
> #include <string.h>
> int main(){
>  int i;
>  char *x=malloc(100000000);
>   for (i=0;i<100;i++)
>    MEMSET(x,0,100000000);
>
> }
>
>
> for I in `seq 1 10`; do
> echo avx
> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> echo rep
> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> done

Ling: Ok, I will test it seriously, then send out new version.

Thanks!
Ling


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]