This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- From: Florian Weimer <fweimer at redhat dot com>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 1 Jun 2017 20:39:05 +0200
- Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=fweimer at redhat dot com
- Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 5C5FA7E9CD
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 5C5FA7E9CD
- References: <20170601154519.GB14526@lucon.org> <33f989bd-5357-086a-27a7-7437718f5ac3@redhat.com> <CAMe9rOpYpksQnqBSZjF1dDM7YMr4Qj6hNdi1MCESBP825ysRrg@mail.gmail.com>
On 06/01/2017 07:19 PM, H.J. Lu wrote:
> On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>>> +L(between_4_7):
>>> + vmovd (%rdi), %xmm1
>>> + vmovd (%rsi), %xmm2
>>> + VPCMPEQ %xmm1, %xmm2, %xmm2
>>> + vpmovmskb %xmm2, %eax
>>> + subl $0xffff, %eax
>>> + jnz L(first_vec)
>>
>> Is this really faster than two 32-bit bswaps followed by a sub?
>
> Can you elaborate how to use bswap here?
Something like this:
/* Load 4 to 7 bytes into an 8-byte word.
ABCDEFG turns into GFEDDCBA.
ABCDEF turns into FEDCDCBA.
ABCDE turns into EDCBDCBA.
ABCD turns into DCBADCBA.
bswapq below reverses the order of bytes.
The duplicated bytes do not affect the comparison result. */
movl -4(%rdi, %rdx), R1
shrq $32, R1
movl -4(%rsi, %rdx), R2
shrq $32, R2
movl ($rdi), R3
orq R3, R1
/* Variant below starts after this point. */
cmpq R1, R2
jne L(diffin8bytes)
xor %eax, %eax
ret
L(diffin8bytes):
bswapq R1
bswapq R2
cmpq R1, R2
sbbl %eax, %eax /* Set to -1 if R1 < R2, otherwise 0. */
orl $1, %eax /* Turn 0 into 1, but preserve -1. */
ret
(Not sure about the right ordering for R1 and R2 here.)
There's a way to avoid the conditional jump completely, but whether
that's worthwhile depends on the cost of the bswapq and the cmove:
bswapq R1
bswapq R2
xorl R3, R3
cmpq R1, R2
sbbl %eax, %eax
orl $1, %eax
cmpq R1, R2
cmove R3, %eax
ret
See this patch and the related discussion:
<https://sourceware.org/ml/libc-alpha/2014-02/msg00139.html>
>> What is ensuring alignment, so that the vmovd instructions cannot fault?
>
> What do you mean? This sequence compares the last 4 bytes with
> vmovd, which loads 4 bytes and zeroes out the high 12 bytes, and
> VPCMPEQ. If they aren't the same, go to L(first_vec).
Ah, I see now. The loads overlap. Maybe add a comment to that effect?
Thanks,
Florian