This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp

From: OndÅej BÃlka <neleai at seznam dot cz>
To: libc-alpha at sourceware dot org
Date: Mon, 3 Feb 2014 21:24:56 +0100
Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
Authentication-results: sourceware.org; auth=none
References: <52EBBCC2 dot 7090807 at redhat dot com> <20140131171911 dot GA25609 at domone dot podge> <52EF931A dot 3000508 at redhat dot com> <20140203144305 dot GA14697 at domone dot podge> <52EFC0CD dot 6030201 at redhat dot com> <20140203194155 dot GA14189 at domone dot podge>

On Mon, Feb 03, 2014 at 05:16:13PM +0100, Florian Weimer wrote:
> On 02/03/2014 03:43 PM, OndÅej BÃlka wrote:
> 
> >And there is third factor that memcmp with small constant arguments
> >could be inlined. This is not case now but a patch would be welcome.
> 
> Inlining memcmp in GCC has historically been a bad decision.
> Perhaps we could make an exception for memcmp calls with known
> alignment and really small sizes.  In terms of GCC optimizations,
> dispatching to a few versions specialized for certain lengths, and a
> version that only delivers an unordered, boolean result promises
> significant wins as well.
> 
> >>I didn't try to exercise the page-crossing code path, although its
> >>byte-wise comparisons would be problematic from a timing oracle
> >>point of view.
> >>
> >I did not optimized that as its cold path and I favoured simplicity.
> >It could be done with more complicated code that would be harder to
> >review and probably slower due to additional branch misprediction.
> 
> I looked at the page-crossing logic and noticed that you trigger it
> based on the addresses alone.  Is it not worth to check if only the
> over-reading part would cross the page boundary?
>
If alignments were uniformly distributed then these checks would fire
with probability 1/32 (In practice less as data tend to be aligned).

You could add additional checks after that decision as adding cycle to
likely path would hurt performance more than gain in cross-page path.
 
> >As for SSE you need to emulate variable shifts
> 
> Is this the mask you compute in %rcx following the handle_end label?
> 
No, that was comment why sse in cross-page case would be more
complicated. You would need a indirect jump to appropriate shift/table
lookup for shuffle and these need to be performed repeately to pay off.

> >int
> >cmp (uint64_t a, uint64_t b)
> >{
> >   uint64_t diff = a ^ b;
> >   uint64_t bit = a ^ (a - 1);
> >   int64_t ret = (a & bit) - (b & bit);
> >   return ret >> 32;
> >}
> 
> This looks broken to me.  diff is dead, and after a little-endian
> load, the bit order in the word is mixed, so additional fiddling is
> needed to derive an accurate ordering from a bit difference.
> 
Yes, I wrote that in hurry. There are two problems that I realized in
meantime, first is that I need to construct byte mask not bit mask like
it looks here.

Second is that I need to do comparison anyway as subtract could
overflow. A conditional move is preferable here as return value is
unpredictable.

It is possible to construct mask with more bit fiddling but I
am not sure if it would be faster than bsf.

References:
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Florian Weimer
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: OndÅej BÃlka
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]