This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Ling Ma <ling dot ma dot program at gmail dot com>
Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
Date: Tue, 30 Jul 2013 06:49:25 +0200
Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz> <CAOGi=dNY9KP_OdGNW79iLiCHu4L=8fCNFg=ZZpMiRFN0CHJZ1g at mail dot gmail dot com>

On Tue, Jul 30, 2013 at 10:08:48AM +0800, Ling Ma wrote:
> 2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
> >
> > On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
> >> From: Ma Ling <ling.ml@alibaba-inc.com>
> >> +ENTRY (MEMSET)
> >> +	vpxor	%xmm0, %xmm0, %xmm0
> >> +	vmovd %esi, %xmm1
> >> +	lea	(%rdi, %rdx), %r8
> >> +	vpshufb	%xmm0, %xmm1, %xmm0
> >> +	mov	%rdi, %rax
> >> +	cmp	$256, %rdx
> >> +	jae	L(256bytesormore)
> >> +	xor	%ecx, %ecx
> >> +	mov %sil, %cl
> >> +	mov %cl, %ch
> > What should be this? You do not need that data and it could slow memset
> > down for 64-128 byte range.
> Ling: memset 64-128 all works on store operation, the code will run
> with parallism.
Really?
> >> +	cmp	$128, %rdx
> >> +	jb	L(less_128bytes)
> > ...
> >> +L(less_128bytes):
> >> +	xor	%esi, %esi
> >> +	mov	%ecx, %esi
> > And this? A C equivalent of this is
> > x = 0;
> > x = y;
> Ling: we used mov %sil, %cl in above code, now %esi become  as
> destination register(mov %ecx, %esi),  there is one false dependence
> hazard, we use xor r1, r1 to ask decode stage to break the dependence,
> and insight pipeline xor r1, r1  will be removed  before entering into
> execution stage.
> 
That is pointless as mov breaks false dependencies.

Anyway a code you use is redudnand. You already have that computed so
simple mov %xmm0, %rcx will do a job.


> >> +	ja	L(gobble_big_data)
> >> +	mov	%rax, %r9
> >> +	mov	%esi, %eax
> >> +	mov	%rdx, %rcx
> >> +	rep	stosb
> >> +	mov	%r9, %rax
> >> +	vzeroupper
> >> +	ret
> >> +
> > Redundant vzeroupper.
> Ling, we touched ymm0 operation before we go to the place :
> +	vinserti128 $1, %xmm0, %ymm0, %ymm0
> +	vmovups	%ymm0, (%rdi)
> so we have to clean up upper parts of ymm0, otherwise following xmm0
> operation have to  be impacted by SAVE penalty.
>
You do not need that. Relevant code is 

+L(256bytesormore):
+       vinserti128 $1, %xmm0, %ymm0, %ymm0
+       vmovups %ymm0, (%rdi)
+       mov     %rdi, %r9
+       and     $-0x20, %rdi
+       add     $32, %rdi
+       sub     %rdi, %r9
+       add     %r9, %rdx
+       cmp     $4096, %rdx
+       ja      L(gobble_data)

A simple reshuftling avoids that and again 

+       cmp     $4096, %rdx
+       ja      L(gobble_data)
+       vinserti128 $1, %xmm0, %ymm0, %ymm0
+       vmovups %ymm0, (%rdi)
+       mov     %rdi, %r9
+       and     $-0x20, %rdi
+       add     $32, %rdi
+       sub     %rdi, %r9
+       add     %r9, %rdx

 
> Thanks
> Ling

Follow-Ups:
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma

References:
- [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: ling . ma . program
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: OndÅej BÃlka
- Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
  - From: Ling Ma

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]