This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: libc-alpha at sourceware dot org
- Date: Mon, 3 Feb 2014 21:24:56 +0100
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com> <20140131171911 dot GA25609 at domone dot podge> <52EF931A dot 3000508 at redhat dot com> <20140203144305 dot GA14697 at domone dot podge> <52EFC0CD dot 6030201 at redhat dot com> <20140203194155 dot GA14189 at domone dot podge>
On Mon, Feb 03, 2014 at 05:16:13PM +0100, Florian Weimer wrote:
> On 02/03/2014 03:43 PM, OndÅej BÃlka wrote:
>
> >And there is third factor that memcmp with small constant arguments
> >could be inlined. This is not case now but a patch would be welcome.
>
> Inlining memcmp in GCC has historically been a bad decision.
> Perhaps we could make an exception for memcmp calls with known
> alignment and really small sizes. In terms of GCC optimizations,
> dispatching to a few versions specialized for certain lengths, and a
> version that only delivers an unordered, boolean result promises
> significant wins as well.
>
> >>I didn't try to exercise the page-crossing code path, although its
> >>byte-wise comparisons would be problematic from a timing oracle
> >>point of view.
> >>
> >I did not optimized that as its cold path and I favoured simplicity.
> >It could be done with more complicated code that would be harder to
> >review and probably slower due to additional branch misprediction.
>
> I looked at the page-crossing logic and noticed that you trigger it
> based on the addresses alone. Is it not worth to check if only the
> over-reading part would cross the page boundary?
>
If alignments were uniformly distributed then these checks would fire
with probability 1/32 (In practice less as data tend to be aligned).
You could add additional checks after that decision as adding cycle to
likely path would hurt performance more than gain in cross-page path.
> >As for SSE you need to emulate variable shifts
>
> Is this the mask you compute in %rcx following the handle_end label?
>
No, that was comment why sse in cross-page case would be more
complicated. You would need a indirect jump to appropriate shift/table
lookup for shuffle and these need to be performed repeately to pay off.
> >int
> >cmp (uint64_t a, uint64_t b)
> >{
> > uint64_t diff = a ^ b;
> > uint64_t bit = a ^ (a - 1);
> > int64_t ret = (a & bit) - (b & bit);
> > return ret >> 32;
> >}
>
> This looks broken to me. diff is dead, and after a little-endian
> load, the bit order in the word is mixed, so additional fiddling is
> needed to derive an accurate ordering from a bit difference.
>
Yes, I wrote that in hurry. There are two problems that I realized in
meantime, first is that I need to construct byte mask not bit mask like
it looks here.
Second is that I need to do comparison anyway as subtract could
overflow. A conditional move is preferable here as return value is
unpredictable.
It is possible to construct mask with more bit fiddling but I
am not sure if it would be faster than bsf.