This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp


On Fri, Jan 31, 2014 at 04:09:54PM +0100, Florian Weimer wrote:
> This patch vectorizes the difference extraction in memcmp.  It does
> so by loading all relevant input bytes, converting them to
> big-endian using BSWAP, and comparing the result.
> 
> This changes the return value of memcmp, but stays within the
> interface contract.  Some corner cases in the old implementation do
> not return actual byte differences, either, so I hope this will not
> break applications.
> 
> On my i7-2620M, some lengths (mostly 5, 6, 7) are about one cycle
> slower than the old implementation, but only if all branches in the
> old implementation are correctly predicted.  If there is

I send new implementation three months ago but did not time to ping it
because I have send a replacement here:

https://sourceware.org/ml/libc-alpha/2013-09/msg00556.html

Could you also test this what is faster?

> misprediction (as there is bound to be when sorting strings), the
> vectorized version is consistently faster.  The glibc benchtests
> favor the old implementation[*], but the speed difference between
> tests slightly favors the new implementation (but the difference is
> not clearly significant).  I could probably game that by using a
> data-dependent branch in for those lengths, but I think the fully
> vectorized version is better for real-world input data.
>
Benchtests are still inadequate here, they test with same inputs in
tight loop which does not measure branch misprediction which gives biased
result.
 
> In case BSWAP turns out to be too slow for some CPUs, it could be
> replaced with BSF and a shift (both constant time in current
> processors), like we already do in some of the existing memcmp
> implementations.
> 
In my implementation I use a test of first 16 bytes.

It really helps when you could read few bytes after end, then test if
match is within. I originaly did that but it did not improve performace
due of misprediction that these check caused. What I have now is from
studying input. In practice a memcmp (x, y, n) which ends early has
typically n < 16 where you could in bitmask of mismatches just set
(n-1)-th bit to one.

> This work was prompted by a desire to make the time taken by memcmp
> independent of the length of the shared prefix of the arguments as
> far as possible.  The current implementation can still be used to
> confirm a correct guess of a 4-byte or 8-byte prefix, but this is
> extremely unlikely and not suitable for use in a timing oracle (at
> least the 8-byte guess).  The new implementation should not act as a
> cache oracle either because it does not perform data-dependent
> single-byte loads.
> 
These factors depend on workload, With distribution where a mismatch on
given byte is ten times less likely than previous one a simple loop is
optimal. 

When you use a memcmp in search and in 90% of time mismatch is in first
byte then adding that test will improve total time. When its mostly
around eigth byte it only slows us down. I do not know answer for that
yet (modulo adding userspace profile feedback counter to gcc to estimate
that probability for given call site).

I did measured performance impact for gcc and firefox

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_firefox/result.html
When I printed what gets compared it is sorting hunspell dictionary.

http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_gcc/result.html
for gcc I then noticed that older gcc version does not use memcmp at all
so input distribution depends on version.

and profiler is here if you want to replicate
http://kam.mff.cuni.cz/~ondra/benchmark_string/memcmp_profile200913.tar.bz2

> If such a change is acceptable in principle, I will make work on
> similar adjustments to the other x86 implementations.
>
I would rather improve new implementation.
 
> [*] Tight loop over identical inputs, after bumping the inner loop
> count from 64 to 4096.  If I don't do this, the tests are very
> fragile, but the new implementation seems to be roughly twice as
> fast across the board.  Using qsort to sort random strings of fixed
> length is about 10% faster over all using the new implementation.  I
> still need to try to sort a random permutation of /usr/dict/word and
> compare the results.
> 
> -- 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]