[PATCH RFC V2] Improve 64bit memcpy/memove for Corei7 with unaligned avx instruction

Fri Jul 12 04:36:00 GMT 2013

On Thu, Jul 11, 2013 at 08:51:36AM -0400, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ml@alibaba-inc.com>
> 
> +L(gobble_big_data_bwd):
> +	sub	$0x80, %rdx
> +L(gobble_mem_bwd_loop):
> +	prefetcht0 -0x1c0(%rsi)
> +	prefetcht0 -0x280(%rsi)
> +	vmovups	-0x10(%rsi), %xmm0
> +	vmovups	-0x20(%rsi), %xmm1
> +	vmovups	-0x30(%rsi), %xmm2
> +	vmovups	-0x40(%rsi), %xmm3
> +	vmovntdq	%xmm0, -0x10(%rdi)
> +	vmovntdq	%xmm1, -0x20(%rdi)
> +	vmovntdq	%xmm2, -0x30(%rdi)
> +	vmovntdq	%xmm3, -0x40(%rdi)
> +	vmovups	-0x50(%rsi), %xmm0
> +	vmovups	-0x60(%rsi), %xmm1
> +	vmovups	-0x70(%rsi), %xmm2
> +	vmovups	-0x80(%rsi), %xmm3
> +	lea	-0x80(%rsi), %rsi
> +	vmovntdq	%xmm0, -0x50(%rdi)
> +	vmovntdq	%xmm1, -0x60(%rdi)
> +	vmovntdq	%xmm2, -0x70(%rdi)
> +	vmovntdq	%xmm3, -0x80(%rdi)
> +	lea	-0x80(%rdi), %rdi
> +	sub	$0x80, %rdx
> +	jae	L(gobble_mem_bwd_loop)
> +	sfence

Wait doing prefetching memory at read and having nontemporal stores?
These aims are contradictory and if you want best memcpy performance
do not use nontemporal store and when we do not want to trash cache we
do not use prefetch and load use nontemporal loads.

Also following code does not use avx. Is it intentional or could it
improve performance?