This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: Florian Weimer <fweimer at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Mon, 03 Feb 2014 14:01:14 +0100
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com> <20140131171911 dot GA25609 at domone dot podge>
On 01/31/2014 06:19 PM, OndÅej BÃlka wrote:
I send new implementation three months ago but did not time to ping it
because I have send a replacement here:
https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html
Could you also test this what is faster?
The main benefit comes from the lack of length-specific special cases.
This means it's considerably faster on my /usr/share/dict/words sorting
benchmark and sorting random binary blobs of random length. If the
length is constant across calls and the indirect jump is predicted
correctly, it is less clear which version is faster. A pairwise t-test
against the BSWAP version on the benchtest output does not yield
conclusive results, although looking at the data suggests that yours is
faster.
I didn't try to exercise the page-crossing code path, although its
byte-wise comparisons would be problematic from a timing oracle point of
view.
In case BSWAP turns out to be too slow for some CPUs, it could be
replaced with BSF and a shift (both constant time in current
processors), like we already do in some of the existing memcmp
implementations.
In my implementation I use a test of first 16 bytes.
It really helps when you could read few bytes after end, then test if
match is within. I originaly did that but it did not improve performace
due of misprediction that these check caused. What I have now is from
studying input. In practice a memcmp (x, y, n) which ends early has
typically n < 16 where you could in bitmask of mismatches just set
(n-1)-th bit to one.
Reading past the end is impossible because of mmap. If we're willing to
run with a libc-supplied SIGSEGV handler, we could do it, but that would
be kind of surprising to application code (even if the handler passed on
unrecognized SIGSEGV to handlers). We already tweak sigprocmask etc.
when threads are enabled, so there is some precedent, but all this seems
rather brittle to me.
(If it weren't for mmap, we could make sure that we never hand out
memory close to the end of an allocated arena on the heap.)
When you use a memcmp in search and in 90% of time mismatch is in first
byte then adding that test will improve total time.
I tried to add such an early bailout to the BSWAP version, and it wasn't
a win. Obviously, such an early check is totally incompatible with my
goal of addressing timing oracles.
If such a change is acceptable in principle, I will make work on
similar adjustments to the other x86 implementations.
I would rather improve new implementation.
Your version definitely looks promising. I think I could live with it
if we got rid of the bytewise comparisons in the cross-page case.
Do you have an idea how to benchmark the cross-page case?
--
Florian Weimer / Red Hat Product Security Team