This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2


On 06/01/2017 07:19 PM, H.J. Lu wrote:
> On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>>> +L(between_4_7):
>>> +     vmovd   (%rdi), %xmm1
>>> +     vmovd   (%rsi), %xmm2
>>> +     VPCMPEQ %xmm1, %xmm2, %xmm2
>>> +     vpmovmskb %xmm2, %eax
>>> +     subl    $0xffff, %eax
>>> +     jnz     L(first_vec)
>>
>> Is this really faster than two 32-bit bswaps followed by a sub?
> 
> Can you elaborate how to use bswap here?

Something like this:

  /* Load 4 to 7 bytes into an 8-byte word.
     ABCDEFG turns into GFEDDCBA.
     ABCDEF  turns into FEDCDCBA.
     ABCDE   turns into EDCBDCBA.
     ABCD    turns into DCBADCBA.
     bswapq below reverses the order of bytes.
     The duplicated bytes do not affect the comparison result.  */
  movl -4(%rdi, %rdx), R1
  shrq $32, R1
  movl -4(%rsi, %rdx), R2
  shrq $32, R2
  movl ($rdi), R3
  orq R3, R1
  /* Variant below starts after this point. */
  cmpq R1, R2
  jne L(diffin8bytes)
  xor %eax, %eax
  ret

L(diffin8bytes):
  bswapq R1
  bswapq R2
  cmpq R1, R2
  sbbl %eax, %eax	/* Set to -1 if R1 < R2, otherwise 0.  */
  orl $1, %eax		/* Turn 0 into 1, but preserve -1.  */
  ret

(Not sure about the right ordering for R1 and R2 here.)

There's a way to avoid the conditional jump completely, but whether
that's worthwhile depends on the cost of the bswapq and the cmove:

  bswapq R1
  bswapq R2
  xorl R3, R3
  cmpq R1, R2
  sbbl %eax, %eax
  orl $1, %eax
  cmpq R1, R2
  cmove R3, %eax
  ret

See this patch and the related discussion:

  <https://sourceware.org/ml/libc-alpha/2014-02/msg00139.html>

>> What is ensuring alignment, so that the vmovd instructions cannot fault?
> 
> What do you mean?  This sequence compares the last 4 bytes with
> vmovd,  which loads 4 bytes and zeroes out the high 12 bytes, and
> VPCMPEQ.  If they aren't the same, go to L(first_vec).

Ah, I see now.  The loads overlap.  Maybe add a comment to that effect?

Thanks,
Florian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]