This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp


On Mon, Feb 03, 2014 at 02:01:14PM +0100, Florian Weimer wrote:
> On 01/31/2014 06:19 PM, OndÅej BÃlka wrote:
> 
> >I send new implementation three months ago but did not time to ping it
> >because I have send a replacement here:
> >
> >https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html
> >
> >Could you also test this what is faster?
> 
> The main benefit comes from the lack of length-specific special
> cases. This means it's considerably faster on my
> /usr/share/dict/words sorting benchmark and sorting random binary
> blobs of random length.  If the length is constant across calls and
> the indirect jump is predicted correctly, it is less clear which
> version is faster.  A pairwise t-test against the BSWAP version on
> the benchtest output does not yield conclusive results, although
> looking at the data suggests that yours is faster.
>
A combination of these factors makes it minority case, optimizing
general case has priority.

A constant argument might be passable when you sort structures.

For indirect jump you need function calls be made close to each other to
keep data in cache. 

And there is third factor that memcmp with small constant arguments 
could be inlined. This is not case now but a patch would be welcome.
 
> I didn't try to exercise the page-crossing code path, although its
> byte-wise comparisons would be problematic from a timing oracle
> point of view.
> 
I did not optimized that as its cold path and I favoured simplicity.
It could be done with more complicated code that would be harder to
review and probably slower due to additional branch misprediction.

As for SSE you need to emulate variable shifts better alternative would be
 do 8byte loads and shift them to zero unneeded bits. I wrote a untested
case for that below which would still be bit slow.

> >>In case BSWAP turns out to be too slow for some CPUs, it could be
> >>replaced with BSF and a shift (both constant time in current
> >>processors), like we already do in some of the existing memcmp
> >>implementations.
> >>
> >In my implementation I use a test of first 16 bytes.
> >
> >It really helps when you could read few bytes after end, then test if
> >match is within. I originaly did that but it did not improve performace
> >due of misprediction that these check caused. What I have now is from
> >studying input. In practice a memcmp (x, y, n) which ends early has
> >typically n < 16 where you could in bitmask of mismatches just set
> >(n-1)-th bit to one.
> 
> Reading past the end is impossible because of mmap.  If we're
> willing to run with a libc-supplied SIGSEGV handler, we could do it,
> but that would be kind of surprising to application code (even if
> the handler passed on unrecognized SIGSEGV to handlers).  We already
> tweak sigprocmask etc. when threads are enabled, so there is some
> precedent, but all this seems rather brittle to me.
> 
It was explaining idea, you check for crossing page boundaries first.

> (If it weren't for mmap, we could make sure that we never hand out
> memory close to the end of an allocated arena on the heap.)
> 
> >When you use a memcmp in search and in 90% of time mismatch is in first
> >byte then adding that test will improve total time.
> 
> I tried to add such an early bailout to the BSWAP version, and it
> wasn't a win.  Obviously, such an early check is totally
> incompatible with my goal of addressing timing oracles.
> 
> >>If such a change is acceptable in principle, I will make work on
> >>similar adjustments to the other x86 implementations.
> >>
> >I would rather improve new implementation.
> 
> Your version definitely looks promising.  I think I could live with
> it if we got rid of the bytewise comparisons in the cross-page case.
> 
> Do you have an idea how to benchmark the cross-page case?
> 
Not yet, as address is in our control we could by randomization ensure
that this branch does not happen often. Benchmarking that is more
difficult as instruction cache and branch cache misses become more
likely.

One way to do this is create hundred or so copies of function and call
these to emulate these conditions.

What I have now is following:

#include <stdint.h>
uint64_t
load (char *x, int n)
{
  uint64_t ret;
  if (((uintptr_t) x) & 4095 > 4096 - 8)
    {
      ret = *((uint64_t *) x - 8 + n);
      ret = ret >> (64 - 8 * n);
      ret = ret << (64 - 8 * n);
    }
  else
    {
      ret = *((uint64_t *) x);
      ret = ret << (64 - 8 * n);
      return ret;
    }
}

int
cmp (uint64_t a, uint64_t b)
{
  uint64_t diff = a ^ b;
  uint64_t bit = a ^ (a - 1);
  int64_t ret = (a & bit) - (b & bit);
  return ret >> 32;
}

int
memcmp_cross (char *x, char *y, int n)
{
  if (n < 8)
    {
      return cmp (load (x, n), load (y, n));
    }
  else
    {
      if (cmp (load (x, 8), load (y, 8)))
	return cmp (load (x, 8), load (y, 8));

      return cmp (load (x + n - 8, 8), load (y + n - 8, 8));
    }
}


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]