This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2

From: Florian Weimer <fweimer at redhat dot com>
To: "H.J. Lu" <hjl dot tools at gmail dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 1 Jun 2017 20:39:05 +0200
Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
Authentication-results: sourceware.org; auth=none
Authentication-results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=fweimer at redhat dot com
Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 5C5FA7E9CD
Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 5C5FA7E9CD
References: <20170601154519.GB14526@lucon.org> <33f989bd-5357-086a-27a7-7437718f5ac3@redhat.com> <CAMe9rOpYpksQnqBSZjF1dDM7YMr4Qj6hNdi1MCESBP825ysRrg@mail.gmail.com>

On 06/01/2017 07:19 PM, H.J. Lu wrote:
> On Thu, Jun 1, 2017 at 9:41 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/01/2017 05:45 PM, H.J. Lu wrote:
>>> +L(between_4_7):
>>> +     vmovd   (%rdi), %xmm1
>>> +     vmovd   (%rsi), %xmm2
>>> +     VPCMPEQ %xmm1, %xmm2, %xmm2
>>> +     vpmovmskb %xmm2, %eax
>>> +     subl    $0xffff, %eax
>>> +     jnz     L(first_vec)
>>
>> Is this really faster than two 32-bit bswaps followed by a sub?
> 
> Can you elaborate how to use bswap here?

Something like this:

  /* Load 4 to 7 bytes into an 8-byte word.
     ABCDEFG turns into GFEDDCBA.
     ABCDEF  turns into FEDCDCBA.
     ABCDE   turns into EDCBDCBA.
     ABCD    turns into DCBADCBA.
     bswapq below reverses the order of bytes.
     The duplicated bytes do not affect the comparison result.  */
  movl -4(%rdi, %rdx), R1
  shrq $32, R1
  movl -4(%rsi, %rdx), R2
  shrq $32, R2
  movl ($rdi), R3
  orq R3, R1
  /* Variant below starts after this point. */
  cmpq R1, R2
  jne L(diffin8bytes)
  xor %eax, %eax
  ret

L(diffin8bytes):
  bswapq R1
  bswapq R2
  cmpq R1, R2
  sbbl %eax, %eax	/* Set to -1 if R1 < R2, otherwise 0.  */
  orl $1, %eax		/* Turn 0 into 1, but preserve -1.  */
  ret

(Not sure about the right ordering for R1 and R2 here.)

There's a way to avoid the conditional jump completely, but whether
that's worthwhile depends on the cost of the bswapq and the cmove:

  bswapq R1
  bswapq R2
  xorl R3, R3
  cmpq R1, R2
  sbbl %eax, %eax
  orl $1, %eax
  cmpq R1, R2
  cmove R3, %eax
  ret

See this patch and the related discussion:

  <https://sourceware.org/ml/libc-alpha/2014-02/msg00139.html>

>> What is ensuring alignment, so that the vmovd instructions cannot fault?
> 
> What do you mean?  This sequence compares the last 4 bytes with
> vmovd,  which loads 4 bytes and zeroes out the high 12 bytes, and
> VPCMPEQ.  If they aren't the same, go to L(first_vec).

Ah, I see now.  The loads overlap.  Maybe add a comment to that effect?

Thanks,
Florian

Follow-Ups:
- Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: H.J. Lu

References:
- [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: H.J. Lu
- Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: Florian Weimer
- Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2
  - From: H.J. Lu

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]