This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] Imporve 64bit memset performance for Haswell CPU with AVX2 instruction

On Mon, Apr 07, 2014 at 01:57:18AM -0400, wrote:
> From: Ling Ma <>
> In this patch we take advantage of HSW memory bandwidth, manage to
> reduce miss branch prediction by avoid using branch instructions and
> force destination to be aligned with avx instruction. 
Now when we have a haswell machine on our department I tested this
implementation. Benchmark used and results are here.

This patch improves large inputs and does not regress
small inputs much which gives a total 10% improvement on gcc test, it
could be improved but it now looks good enough.

I tried two alternatives. First is using avx2 in header(memset_fuse). 
It look it helps, it adds additional 0.5% of performance. However I tried to
crosscheck this with bash shell where comparison is in opposite
direction so I not entirely sure yet, see

Second is checking if rep treshold is best one,
this depends on application cache layout I do not have definite answer
yet (memset_rep and memset_avx_v2 variants), when data is in L2 cache we
could lower treshold to 1024 bytes but it slows real inputs for some reason.

> The CPU2006 403.gcc benchmark also indicate this patch improves performance
> from  22.9% to 59% compared with original memset implemented by sse2.
I inspected that benchmark with my profiler is not that good as its only simple
part of gcc and two third of total time is spend on 240 long inputs.

A large part of speedup could be explained that avx2 implementation has
a special case branch for 128-256 byte range but current one uses loop.
These distributions are different from other program and running gcc
itself as short inputs are more common there.

> +	ALIGN(4)
> +L(gobble_data):
> +#else
> +	mov	__x86_shared_cache_size_half(%rip), %r9
> +#endif
> +	shl	$4, %r9
> +	cmp	%r9, %rdx
> +	ja	L(gobble_big_data)
> +	mov	%rax, %r9
> +	mov	%esi, %eax
> +	mov	%rdx, %rcx
> +	rep	stosb
> +	mov	%r9, %rax
> +	vzeroupper
> +	ret
> +
> +	ALIGN(4)
> +L(gobble_big_data):
> +	sub	$0x80, %rdx
> +L(gobble_big_data_loop):
> +	vmovntdq	%ymm0, (%rdi)
> +	vmovntdq	%ymm0, 0x20(%rdi)
> +	vmovntdq	%ymm0, 0x40(%rdi)
> +	vmovntdq	%ymm0, 0x60(%rdi)
> +	lea	0x80(%rdi), %rdi
> +	sub	$0x80, %rdx
> +	jae	L(gobble_big_data_loop)
> +	vmovups	%ymm0, -0x80(%r8)
> +	vmovups	%ymm0, -0x60(%r8)
> +	vmovups	%ymm0, -0x40(%r8)
> +	vmovups	%ymm0, -0x20(%r8)
> +	vzeroupper
> +	sfence
> +	ret

That loop does seem to help on haswell at all, It is indistingushible from
rep stosb loop above. I used following benchmark to check that with
different sizes but performance stayed same.

#include <stdlib.h>
#include <string.h>
int main(){
 int i;
 char *x=malloc(100000000);
  for (i=0;i<100;i++)


for I in `seq 1 10`; do
echo avx
gcc -L. -DMEMSET=__memset_avx2 -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out
echo rep
gcc -L. -DMEMSET=__memset_rep -lc_profile big.c
time LD_LIBRARY_PATH=. ./a.out

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]