This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp

From: Florian Weimer <fweimer at redhat dot com>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Mon, 03 Feb 2014 14:01:14 +0100
Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
Authentication-results: sourceware.org; auth=none
References: <52EBBCC2 dot 7090807 at redhat dot com> <20140131171911 dot GA25609 at domone dot podge>

On 01/31/2014 06:19 PM, OndÅej BÃlka wrote:

I send new implementation three months ago but did not time to ping it
because I have send a replacement here:

https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html

Could you also test this what is faster?

The main benefit comes from the lack of length-specific special cases.This means it's considerably faster on my /usr/share/dict/words sortingbenchmark and sorting random binary blobs of random length. If thelength is constant across calls and the indirect jump is predictedcorrectly, it is less clear which version is faster. A pairwise t-testagainst the BSWAP version on the benchtest output does not yieldconclusive results, although looking at the data suggests that yours isfaster.

I didn't try to exercise the page-crossing code path, although itsbyte-wise comparisons would be problematic from a timing oracle point ofview.

In case BSWAP turns out to be too slow for some CPUs, it could be
replaced with BSF and a shift (both constant time in current
processors), like we already do in some of the existing memcmp
implementations.

In my implementation I use a test of first 16 bytes.

It really helps when you could read few bytes after end, then test if
match is within. I originaly did that but it did not improve performace
due of misprediction that these check caused. What I have now is from
studying input. In practice a memcmp (x, y, n) which ends early has
typically n < 16 where you could in bitmask of mismatches just set
(n-1)-th bit to one.

Reading past the end is impossible because of mmap. If we're willing torun with a libc-supplied SIGSEGV handler, we could do it, but that wouldbe kind of surprising to application code (even if the handler passed onunrecognized SIGSEGV to handlers). We already tweak sigprocmask etc.when threads are enabled, so there is some precedent, but all this seemsrather brittle to me.

(If it weren't for mmap, we could make sure that we never hand outmemory close to the end of an allocated arena on the heap.)

When you use a memcmp in search and in 90% of time mismatch is in first
byte then adding that test will improve total time.

I tried to add such an early bailout to the BSWAP version, and it wasn'ta win. Obviously, such an early check is totally incompatible with mygoal of addressing timing oracles.

If such a change is acceptable in principle, I will make work on
similar adjustments to the other x86 implementations.

I would rather improve new implementation.

Your version definitely looks promising. I think I could live with itif we got rid of the bytewise comparisons in the cross-page case.


Do you have an idea how to benchmark the cross-page case?

--
Florian Weimer / Red Hat Product Security Team

Follow-Ups:
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]