This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction

From: Ling Ma <ling dot ma dot program at gmail dot com>
To: Ondřej Bílka <neleai at seznam dot cz>
Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Tue, 30 Jul 2013 10:08:48 +0800
Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz>

2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
>
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>> +ENTRY (MEMSET)
>> +	vpxor	%xmm0, %xmm0, %xmm0
>> +	vmovd %esi, %xmm1
>> +	lea	(%rdi, %rdx), %r8
>> +	vpshufb	%xmm0, %xmm1, %xmm0
>> +	mov	%rdi, %rax
>> +	cmp	$256, %rdx
>> +	jae	L(256bytesormore)
>> +	xor	%ecx, %ecx
>> +	mov %sil, %cl
>> +	mov %cl, %ch
> What should be this? You do not need that data and it could slow memset
> down for 64-128 byte range.
Ling: memset 64-128 all works on store operation, the code will run
with parallism.
>> +	cmp	$128, %rdx
>> +	jb	L(less_128bytes)
> ...
>> +L(less_128bytes):
>> +	xor	%esi, %esi
>> +	mov	%ecx, %esi
> And this? A C equivalent of this is
> x = 0;
> x = y;
Ling: we used mov %sil, %cl in above code, now %esi become  as
destination register(mov %ecx, %esi),  there is one false dependence
hazard, we use xor r1, r1 to ask decode stage to break the dependence,
and insight pipeline xor r1, r1  will be removed  before entering into
execution stage.

> Having elementary errors like this does not inspire lot of confidence.
>> +	shl	$16, %ecx
>> +	cmp	$64, %edx
>> +	jb	L(less_64bytes)
>> +L(less_64bytes):
>> +	orl	%esi, %ecx
>> +	mov	%ecx, %esi
>> +	cmp	$32, %edx
>> +	jb	L(less_32bytes)
> ...
>> +L(less_32bytes):
>> +	shl	$32, %rcx
>> +	cmp	$16, %edx
>> +	jb	L(less_16bytes)
>> +L(less_16bytes):
>> +	or	%rsi, %rcx
>> +	cmp	$8, %edx
>> +	jb	L(less_8bytes)
>> +	mov %rcx, (%rdi)
>> +	mov %rcx, -0x08(%r8)
>> +	ret
>> +	ALIGN(4)
> ...
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> +	mov	$SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> +	mov	__x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> +	shl	$4, %r9
>> +	cmp	%r9, %rdx
>> +	ja	L(gobble_big_data)
>> +	mov	%rax, %r9
>> +	mov	%esi, %eax
>> +	mov	%rdx, %rcx
>> +	rep	stosb
>> +	mov	%r9, %rax
>> +	vzeroupper
>> +	ret
>> +
> Redundant vzeroupper.
Ling, we touched ymm0 operation before we go to the place :
+	vinserti128 $1, %xmm0, %ymm0, %ymm0
+	vmovups	%ymm0, (%rdi)
so we have to clean up upper parts of ymm0, otherwise following xmm0
operation have to  be impacted by SAVE penalty.

Thanks
Ling

Follow-Ups:
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

References:
- [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]