This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: ling dot ma dot program at gmail dot com
- Cc: libc-alpha at sourceware dot org, aj at suse dot com, liubov dot dmitrieva at gmail dot com, Ma Ling <ling dot ml at alibaba-inc dot com>
- Date: Mon, 29 Jul 2013 19:19:27 +0200
- Subject: Re: [PATCH RFC 2/2 V3] Improve 64bit memset for Corei7 with avx2 instruction
- References: <1375090922-8418-1-git-send-email-ling dot ma dot program at gmail dot com>
On Mon, Jul 29, 2013 at 05:42:02AM -0400, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ml@alibaba-inc.com>
> +ENTRY (MEMSET)
> + vpxor %xmm0, %xmm0, %xmm0
> + vmovd %esi, %xmm1
> + lea (%rdi, %rdx), %r8
> + vpshufb %xmm0, %xmm1, %xmm0
> + mov %rdi, %rax
> + cmp $256, %rdx
> + jae L(256bytesormore)
> + xor %ecx, %ecx
> + mov %sil, %cl
> + mov %cl, %ch
What should be this? You do not need that data and it could slow memset
down for 64-128 byte range.
...
> + cmp $128, %rdx
> + jb L(less_128bytes)
...
> +L(less_128bytes):
> + xor %esi, %esi
> + mov %ecx, %esi
And this? A C equivalent of this is
x = 0;
x = y;
which is clearly redundant.
Having elementary errors like this does not inspire lot of confidence.
> + shl $16, %ecx
> + cmp $64, %edx
> + jb L(less_64bytes)
> +L(less_64bytes):
> + orl %esi, %ecx
> + mov %ecx, %esi
> + cmp $32, %edx
> + jb L(less_32bytes)
...
> +L(less_32bytes):
> + shl $32, %rcx
> + cmp $16, %edx
> + jb L(less_16bytes)
> +L(less_16bytes):
> + or %rsi, %rcx
> + cmp $8, %edx
> + jb L(less_8bytes)
> + mov %rcx, (%rdi)
> + mov %rcx, -0x08(%r8)
> + ret
> + ALIGN(4)
...
> +L(gobble_data):
> +#ifdef SHARED_CACHE_SIZE_HALF
> + mov $SHARED_CACHE_SIZE_HALF, %r9
> +#else
> + mov __x86_shared_cache_size_half(%rip), %r9
> +#endif
> + shl $4, %r9
> + cmp %r9, %rdx
> + ja L(gobble_big_data)
> + mov %rax, %r9
> + mov %esi, %eax
> + mov %rdx, %rcx
> + rep stosb
> + mov %r9, %rax
> + vzeroupper
> + ret
> +
Redundant vzeroupper.