This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction


2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
>
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>> +ENTRY (MEMSET)
>> +	vpxor	%xmm0, %xmm0, %xmm0
>> +	vmovd %esi, %xmm1
>> +	lea	(%rdi, %rdx), %r8
>> +	vpshufb	%xmm0, %xmm1, %xmm0
>> +	mov	%rdi, %rax
>> +	cmp	$256, %rdx
>> +	jae	L(256bytesormore)
>> +	xor	%ecx, %ecx
>> +	mov %sil, %cl
>> +	mov %cl, %ch
> What should be this? You do not need that data and it could slow memset
> down for 64-128 byte range.
Ling: memset 64-128 all works on store operation, the code will run
with parallism.
>> +	cmp	$128, %rdx
>> +	jb	L(less_128bytes)
> ...
>> +L(less_128bytes):
>> +	xor	%esi, %esi
>> +	mov	%ecx, %esi
> And this? A C equivalent of this is
> x = 0;
> x = y;
Ling: we used mov %sil, %cl in above code, now %esi become  as
destination register(mov %ecx, %esi),  there is one false dependence
hazard, we use xor r1, r1 to ask decode stage to break the dependence,
and insight pipeline xor r1, r1  will be removed  before entering into
execution stage.

> Having elementary errors like this does not inspire lot of confidence.
>> +	shl	$16, %ecx
>> +	cmp	$64, %edx
>> +	jb	L(less_64bytes)
>> +L(less_64bytes):
>> +	orl	%esi, %ecx
>> +	mov	%ecx, %esi
>> +	cmp	$32, %edx
>> +	jb	L(less_32bytes)
> ...
>> +L(less_32bytes):
>> +	shl	$32, %rcx
>> +	cmp	$16, %edx
>> +	jb	L(less_16bytes)
>> +L(less_16bytes):
>> +	or	%rsi, %rcx
>> +	cmp	$8, %edx
>> +	jb	L(less_8bytes)
>> +	mov %rcx, (%rdi)
>> +	mov %rcx, -0x08(%r8)
>> +	ret
>> +	ALIGN(4)
> ...
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> +	mov	__x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> +	shl	$4, %r9
>> +	cmp	%r9, %rdx
>> +	ja	L(gobble_big_data)
>> +	mov	%rax, %r9
>> +	mov	%esi, %eax
>> +	mov	%rdx, %rcx
>> +	rep	stosb
>> +	mov	%r9, %rax
>> +	vzeroupper
>> +	ret
>> +
> Redundant vzeroupper.
Ling, we touched ymm0 operation before we go to the place :
+	vinserti128 $1, %xmm0, %ymm0, %ymm0
+	vmovups	%ymm0, (%rdi)
so we have to clean up upper parts of ymm0, otherwise following xmm0
operation have to  be impacted by SAVE penalty.

Thanks
Ling


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]