This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- From: Ling Ma <ling dot ma dot program at gmail dot com>
- To: Ondřej Bílka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Tue, 30 Jul 2013 10:08:48 +0800
- Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com> <20130729171927 dot GA12218 at domone dot kolej dot mff dot cuni dot cz>
2013/7/30, OndÅej BÃlka <neleai@seznam.cz>:
>
> On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
>> From: Ma Ling <ling.ml@alibaba-inc.com>
>> +ENTRY (MEMSET)
>> + vpxor %xmm0, %xmm0, %xmm0
>> + vmovd %esi, %xmm1
>> + lea (%rdi, %rdx), %r8
>> + vpshufb %xmm0, %xmm1, %xmm0
>> + mov %rdi, %rax
>> + cmp $256, %rdx
>> + jae L(256bytesormore)
>> + xor %ecx, %ecx
>> + mov %sil, %cl
>> + mov %cl, %ch
> What should be this? You do not need that data and it could slow memset
> down for 64-128 byte range.
Ling: memset 64-128 all works on store operation, the code will run
with parallism.
>> + cmp $128, %rdx
>> + jb L(less_128bytes)
> ...
>> +L(less_128bytes):
>> + xor %esi, %esi
>> + mov %ecx, %esi
> And this? A C equivalent of this is
> x = 0;
> x = y;
Ling: we used mov %sil, %cl in above code, now %esi become as
destination register(mov %ecx, %esi), there is one false dependence
hazard, we use xor r1, r1 to ask decode stage to break the dependence,
and insight pipeline xor r1, r1 will be removed before entering into
execution stage.
> Having elementary errors like this does not inspire lot of confidence.
>> + shl $16, %ecx
>> + cmp $64, %edx
>> + jb L(less_64bytes)
>> +L(less_64bytes):
>> + orl %esi, %ecx
>> + mov %ecx, %esi
>> + cmp $32, %edx
>> + jb L(less_32bytes)
> ...
>> +L(less_32bytes):
>> + shl $32, %rcx
>> + cmp $16, %edx
>> + jb L(less_16bytes)
>> +L(less_16bytes):
>> + or %rsi, %rcx
>> + cmp $8, %edx
>> + jb L(less_8bytes)
>> + mov %rcx, (%rdi)
>> + mov %rcx, -0x08(%r8)
>> + ret
>> + ALIGN(4)
> ...
>> +L(gobble_data):
>> +#ifdef SHARED_CACHE_SIZE_HALF
>> + mov $SHARED_CACHE_SIZE_HALF, %r9
>> +#else
>> + mov __x86_shared_cache_size_half(%rip), %r9
>> +#endif
>> + shl $4, %r9
>> + cmp %r9, %rdx
>> + ja L(gobble_big_data)
>> + mov %rax, %r9
>> + mov %esi, %eax
>> + mov %rdx, %rcx
>> + rep stosb
>> + mov %r9, %rax
>> + vzeroupper
>> + ret
>> +
> Redundant vzeroupper.
Ling, we touched ymm0 operation before we go to the place :
+ vinserti128 $1, %xmm0, %ymm0, %ymm0
+ vmovups %ymm0, (%rdi)
so we have to clean up upper parts of ymm0, otherwise following xmm0
operation have to be impacted by SAVE penalty.
Thanks
Ling