This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


2017-06-16 4:15 GMT+02:00 H.J. Lu <hjl.tools@gmail.com>:
> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
>>> Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
>>> much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
>>> AVX unaligned load is fast.
>>>
>>> Key features:
>>>
>>> 1. Use overlapping compare to avoid branch.
>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>>>    bytes for wmemcmp.
>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
>>>
>>> Any comments?
>>>
>> I have some comments, its similar to one of my previous patches
>>
>>> +     cmpq    $(VEC_SIZE * 2), %rdx
>>> +     ja      L(more_2x_vec)
>>> +
>> This is unnecessary branch, its likely that there is difference in first
>> 16 bytes regardless of size. Move test about sizes...
>>> +L(last_2x_vec):
>>> +     /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
>>> +     vmovdqu (%rsi), %ymm2
>>> +     VPCMPEQ (%rdi), %ymm2, %ymm2
>>> +     vpmovmskb %ymm2, %eax
>>> +     subl    $VEC_MASK, %eax
>>> +     jnz     L(first_vec)
>> here.
>>
>
> If we do that, the size check will be redundant from
>
>         /* Less than 4 * VEC.  */
>         cmpq    $VEC_SIZE, %rdx
>         jbe     L(last_vec)
>         cmpq    $(VEC_SIZE * 2), %rdx
>         jbe     L(last_2x_vec)
>
> L(last_4x_vec):
>
> Of cause, we can duplicate these blocks to avoid size.
>
>>
>>> +L(first_vec):
>>> +     /* A byte or int32 is different within 16 or 32 bytes.  */
>>> +     bsfl    %eax, %ecx
>>> +# ifdef USE_AS_WMEMCMP
>>> +     xorl    %eax, %eax
>>> +     movl    (%rdi, %rcx), %edx
>>> +     cmpl    (%rsi, %rcx), %edx
>>> +L(wmemcmp_return):
>>> +     setl    %al
>>> +     negl    %eax
>>> +     orl     $1, %eax
>>> +# else
>>> +     movzbl  (%rdi, %rcx), %eax
>>> +     movzbl  (%rsi, %rcx), %edx
>>> +     sub     %edx, %eax
>>> +# endif
>>> +     VZEROUPPER
>>> +     ret
>>> +
>>
>> Loading bytes depending on result of bsf is slow, alternative is to find
>> that from vector tests. I could avoid it using tests like this but I
>> didn't measure performance/test it yet.
>>
>> vmovdqu (%rdi), %ymm3
>>
>> VPCMPGTQ %ymm2, %ymm3, %ymm4
>> VPCMPGTQ %ymm3, %ymm2, %ymm5
>> vpmovmskb %ymm4, %eax
>> vpmovmskb %ymm5, %edx
>> neg %eax
>> neg %edx
>> lzcnt %eax, %eax
>> lzcnt %edx, %edx
>> sub %edx, %eax
>> ret
>
> Andrew, can you give it a try?

Hi Ondrej, could you send patch with you proposal?
I have tried with the following change and got many test-memcmp wrong results:

<       leaq    -VEC_SIZE(%rdi, %rdx), %rdi
<       leaq    -VEC_SIZE(%rsi, %rdx), %rsi
<       vmovdqu (%rsi), %ymm2
<       VPCMPEQ (%rdi), %ymm2, %ymm2
---
>       leaq    -VEC_SIZE(%rdi, %rdx), %r8
>       leaq    -VEC_SIZE(%rsi, %rdx), %r9
>       vmovdqu (%r9), %ymm2
>       VPCMPEQ (%r8), %ymm2, %ymm2
91,104c91,103
<       tzcntl  %eax, %ecx
< # ifdef USE_AS_WMEMCMP
<       xorl    %eax, %eax
<       movl    (%rdi, %rcx), %edx
<       cmpl    (%rsi, %rcx), %edx
< L(wmemcmp_return):
<       setl    %al
<       negl    %eax
<       orl     $1, %eax
< # else
<       movzbl  (%rdi, %rcx), %eax
<       movzbl  (%rsi, %rcx), %edx
<       sub     %edx, %eax
< # endif
---
>       vmovdqu (%rsi), %ymm2
>       vmovdqu (%rdi), %ymm3
>
>       VPCMPGTQ %ymm2, %ymm3, %ymm4
>       VPCMPGTQ %ymm3, %ymm2, %ymm5
>       vpmovmskb %ymm4, %eax
>       vpmovmskb %ymm5, %edx
>       neg %eax
>       neg %edx
>       lzcnt %eax, %eax
>       lzcnt %edx, %edx
>       sub %edx, %eax
>

Thanks.


--
WBR,
Andrew


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]