This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction


Correction, in for following

On Tue, May 13, 2014 at 07:36:16PM +0200, OndÅej BÃlka wrote:
> > +	ALIGN(4)
> > +L(gobble_data):
> > +#ifdef SHARED_CACHE_SIZE_HALF
> > +	mov	$SHARED_CACHE_SIZE_HALF, %r9
> > +#else
> > +	mov	__x86_shared_cache_size_half(%rip), %r9
> > +#endif
> > +	shl	$4, %r9
> > +	cmp	%r9, %rdx
> > +	ja	L(gobble_big_data)
> > +	mov	%rax, %r9
> > +	mov	%esi, %eax
> > +	mov	%rdx, %rcx
> > +	rep	stosb
> > +	mov	%r9, %rax
> > +	vzeroupper
> > +	ret
> > +
> > +	ALIGN(4)
> > +L(gobble_big_data):
> > +	sub	$0x80, %rdx
> > +L(gobble_big_data_loop):
> > +	vmovntdq	%ymm0, (%rdi)
> > +	vmovntdq	%ymm0, 0x20(%rdi)
> > +	vmovntdq	%ymm0, 0x40(%rdi)
> > +	vmovntdq	%ymm0, 0x60(%rdi)
> > +	lea	0x80(%rdi), %rdi
> > +	sub	$0x80, %rdx
> > +	jae	L(gobble_big_data_loop)
> > +	vmovups	%ymm0, -0x80(%r8)
> > +	vmovups	%ymm0, -0x60(%r8)
> > +	vmovups	%ymm0, -0x40(%r8)
> > +	vmovups	%ymm0, -0x20(%r8)
> > +	vzeroupper
> > +	sfence
> > +	ret
> 
> That loop does seem to help on haswell at all, It is indistingushible from
> rep stosb loop above. I used following benchmark to check that with
> different sizes but performance stayed same.
> 
> #include <stdlib.h>
> #include <string.h>
> int main(){
>  int i;
>  char *x=malloc(100000000);
>   for (i=0;i<100;i++)
>    MEMSET(x,0,100000000);
> 
> }
> 
> 
> for I in `seq 1 10`; do
> echo avx
> gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> echo rep
> gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
> time LD_LIBRARY_PATH=. ./a.out
> done

Sorry I forgotten that __memset_rep also has branch for large inputs so
what I wrote was wrong.

I retested it with fixed rep stosq and your loop is around 20% slower on
similar test so its better to remove that loop.

$ gcc big.c -o big
$ time LD_PRELOAD=./memset-avx2.so ./big

real    0m0.076s
user    0m0.066s
sys     0m0.010s

$ time LD_PRELOAD=./memset_rep.so ./big

real    0m0.063s
user    0m0.042s
sys     0m0.021s

I use a different benchmark to be sure, it could be download here and
run it commands above in that directory.

http://kam.mff.cuni.cz/~ondra/memset_consistency_benchmark.tar.bz2

For different implementation you need to create .so with function
memset, there is script compile that compiles all .s files provided that
first line is of shape

# arch_requirement function_name color


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]