This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On Sat, Jun 17, 2017 at 3:44 AM, Andrew Senkevich
<andrew.n.senkevich@gmail.com> wrote:
> 2017-06-16 4:15 GMT+02:00 H.J. Lu <hjl.tools@gmail.com>:
>> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote:
>>>> Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
>>>> much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
>>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
>>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
>>>> AVX unaligned load is fast.
>>>>
>>>> Key features:
>>>>
>>>> 1. Use overlapping compare to avoid branch.
>>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8
>>>>    bytes for wmemcmp.
>>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop.
>>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
>>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less.
>>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less.
>>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less.
>>>>
>>>> Any comments?
>>>>
>>> I have some comments, its similar to one of my previous patches
>>>
>>>> +     cmpq    $(VEC_SIZE * 2), %rdx
>>>> +     ja      L(more_2x_vec)
>>>> +
>>> This is unnecessary branch, its likely that there is difference in first
>>> 16 bytes regardless of size. Move test about sizes...
>>>> +L(last_2x_vec):
>>>> +     /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
>>>> +     vmovdqu (%rsi), %ymm2
>>>> +     VPCMPEQ (%rdi), %ymm2, %ymm2
>>>> +     vpmovmskb %ymm2, %eax
>>>> +     subl    $VEC_MASK, %eax
>>>> +     jnz     L(first_vec)
>>> here.
>>>
>>
>> If we do that, the size check will be redundant from
>>
>>         /* Less than 4 * VEC.  */
>>         cmpq    $VEC_SIZE, %rdx
>>         jbe     L(last_vec)
>>         cmpq    $(VEC_SIZE * 2), %rdx
>>         jbe     L(last_2x_vec)
>>
>> L(last_4x_vec):
>>
>> Of cause, we can duplicate these blocks to avoid size.
>>
>>>
>>>> +L(first_vec):
>>>> +     /* A byte or int32 is different within 16 or 32 bytes.  */
>>>> +     bsfl    %eax, %ecx
>>>> +# ifdef USE_AS_WMEMCMP
>>>> +     xorl    %eax, %eax
>>>> +     movl    (%rdi, %rcx), %edx
>>>> +     cmpl    (%rsi, %rcx), %edx
>>>> +L(wmemcmp_return):
>>>> +     setl    %al
>>>> +     negl    %eax
>>>> +     orl     $1, %eax
>>>> +# else
>>>> +     movzbl  (%rdi, %rcx), %eax
>>>> +     movzbl  (%rsi, %rcx), %edx
>>>> +     sub     %edx, %eax
>>>> +# endif
>>>> +     VZEROUPPER
>>>> +     ret
>>>> +
>>>
>>> Loading bytes depending on result of bsf is slow, alternative is to find
>>> that from vector tests. I could avoid it using tests like this but I
>>> didn't measure performance/test it yet.
>>>
>>> vmovdqu (%rdi), %ymm3
>>>
>>> VPCMPGTQ %ymm2, %ymm3, %ymm4
>>> VPCMPGTQ %ymm3, %ymm2, %ymm5
>>> vpmovmskb %ymm4, %eax
>>> vpmovmskb %ymm5, %edx
>>> neg %eax
>>> neg %edx
>>> lzcnt %eax, %eax
>>> lzcnt %edx, %edx
>>> sub %edx, %eax
>>> ret
>>
>> Andrew, can you give it a try?
>
> Hi Ondrej, could you send patch with you proposal?
> I have tried with the following change and got many test-memcmp wrong results:

We can't use VPCMPGT for memcmp since it performs signed
comparison, but memcmp requires unsigned comparison.

H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]