This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Fri, 31 Jan 2014 18:19:11 +0100
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com>
On Fri, Jan 31, 2014 at 04:09:54PM +0100, Florian Weimer wrote:
> This patch vectorizes the difference extraction in memcmp. It does
> so by loading all relevant input bytes, converting them to
> big-endian using BSWAP, and comparing the result.
>
> This changes the return value of memcmp, but stays within the
> interface contract. Some corner cases in the old implementation do
> not return actual byte differences, either, so I hope this will not
> break applications.
>
> On my i7-2620M, some lengths (mostly 5, 6, 7) are about one cycle
> slower than the old implementation, but only if all branches in the
> old implementation are correctly predicted. If there is
I send new implementation three months ago but did not time to ping it
because I have send a replacement here:
https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html
Could you also test this what is faster?
> misprediction (as there is bound to be when sorting strings), the
> vectorized version is consistently faster. The glibc benchtests
> favor the old implementation[*], but the speed difference between
> tests slightly favors the new implementation (but the difference is
> not clearly significant). I could probably game that by using a
> data-dependent branch in for those lengths, but I think the fully
> vectorized version is better for real-world input data.
>
Benchtests are still inadequate here, they test with same inputs in
tight loop which does not measure branch misprediction which gives biased
result.
> In case BSWAP turns out to be too slow for some CPUs, it could be
> replaced with BSF and a shift (both constant time in current
> processors), like we already do in some of the existing memcmp
> implementations.
>
In my implementation I use a test of first 16 bytes.
It really helps when you could read few bytes after end, then test if
match is within. I originaly did that but it did not improve performace
due of misprediction that these check caused. What I have now is from
studying input. In practice a memcmp (x, y, n) which ends early has
typically n < 16 where you could in bitmask of mismatches just set
(n-1)-th bit to one.
> This work was prompted by a desire to make the time taken by memcmp
> independent of the length of the shared prefix of the arguments as
> far as possible. The current implementation can still be used to
> confirm a correct guess of a 4-byte or 8-byte prefix, but this is
> extremely unlikely and not suitable for use in a timing oracle (at
> least the 8-byte guess). The new implementation should not act as a
> cache oracle either because it does not perform data-dependent
> single-byte loads.
>
These factors depend on workload, With distribution where a mismatch on
given byte is ten times less likely than previous one a simple loop is
optimal.
When you use a memcmp in search and in 90% of time mismatch is in first
byte then adding that test will improve total time. When its mostly
around eigth byte it only slows us down. I do not know answer for that
yet (modulo adding userspace profile feedback counter to gcc to estimate
that probability for given call site).
I did measured performance impact for gcc and firefox
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_firefox/result.html
When I printed what gets compared it is sorting hunspell dictionary.
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_gcc/result.html
for gcc I then noticed that older gcc version does not use memcmp at all
so input distribution depends on version.
and profiler is here if you want to replicate
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_profile200913.tar.bz2
> If such a change is acceptable in principle, I will make work on
> similar adjustments to the other x86 implementations.
>
I would rather improve new implementation.
> [*] Tight loop over identical inputs, after bumping the inner loop
> count from 64 to 4096. If I don't do this, the tests are very
> fragile, but the new implementation seems to be roughly twice as
> fast across the board. Using qsort to sort random strings of fixed
> length is about 10% faster over all using the new implementation. I
> still need to try to sort a random permutation of /usr/dict/word and
> compare the results.
>
> --