This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp

From: "Carlos O'Donell" <carlos at redhat dot com>
To: Florian Weimer <fweimer at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Cc: OndÅej BÃlka <neleai at seznam dot cz>
Date: Mon, 03 Feb 2014 16:30:37 -0500
Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
Authentication-results: sourceware.org; auth=none
References: <52EBBCC2 dot 7090807 at redhat dot com> <52EBC480 dot 4000509 at redhat dot com> <52EF6E9E dot 8050708 at redhat dot com>

On 02/03/2014 05:25 AM, Florian Weimer wrote:
> On 01/31/2014 04:42 PM, Carlos O'Donell wrote:
> 
>> * More comments in the assembly explaining the structure of the new
>> code and how it operates.
> 
> Okay, the comment-worthy aspects might be:
> 
> Some meta-comment explaining the byte numbering which illustrates the
> register contents in the corner cases.
> 
> The rationale for the big-endian conversion (to match address
> ordering), separate 32-bit and 64-bit cases (64-bit BSWAP is much
> slower), and perhaps the conversion from flags to the -1/1 result
> using SBB/OR.  (All this applies to diffin4bytes/diffin8bytes.)
> 
> Unaligned loads overlapping with data previously compared as equal.
> This data cannot affect the outcome of the final comparison.  (This
> trick was already used before, but it's not entirely obvious.)
> 
> The SETNE/CMOVAE sequence for the branchless sign function.
> 
> Cases where the sign function isn't needed because the result of the
> sign subtraction fits into the 32-bit int result.
> 
> Anything else?

No, that looks good from a commenting perspective.

>> * Verification via the microbenchmarks that it actually makes 
>> things faster. Additionally we'd like you add to the
>> microbenchmarks if you find we need another test to measure your
>> changes. See benchtests/README. We want to be as objective as
>> possible here and allow others to run the same or similar tests to
>> compare your patch on their hardware.
> 
> There already is bench-memcmp.  I analyzed its output (after bumping
> the inner loop count to a reasonable value) and it does not show that
> the new implementation is slower across the board.  I also verified
> this using R, with a paired t-test: the new implementation appears to
> be slightly faster, but it's not statistically significant.

Thanks, that's what I wanted to know.

> In addition, I performed specialized comparisons for varying lengths
> and shared prefixes, and benchmarked qsort on random data and
> /usr/share/dict/words.  My test loop uses CPUID, yet it turns out
> that getting good measurements of individual memcmp calls is really
> quite tricky.  My benchmark code is available here:
> <https://github.com/fweimer/memcmp-timing>

Do you think it would be useful to integrate your benchmark as an
additional test, or even replace the memcmp test we have with yours?

> What would be a reasonable benchmark?  We could read the manual
> sources, permute their lines randomly, sort them using qsort, and
> measure the total time used for sorting.  Or we could generate random
> input data based on some distribution and measure sorting time for
> that.  The challenge is to get reasonable input data without bundling
> large files.  If we reuse something that is part of the source code
> distribution anyway, comparisons between different releases are not
> valid, but so is anything that uses qsort.  But I would recommend the
> sorting approach because it seems realistic to me, and qsort does not
> all that much overhead over the actual comparison.

We do not know what would be a reasonable benchmark or we would
have it :-)

We are trying to gather consensus on reasonable benchmarks and
put them into the glibc microbenchmarks.

We know that at some point we need to add whole-system benchmarking,
but we aren't there yet. Ondrej did propose a framework for this
but it hasn't been reviewed.

> One thing I would like to get consensus on eventually is whether
> future implementations of memcmp must not behave as a timing oracle.
> (There might be architectures where achieving this has a run-time
> cost.)  If we are unwilling to make this promise, the oracle-free
> version would have to be conditional on _FORTIFY_SOURCE.

The present goal of the library is to be a high performance C library.
Requiring that memcmp not behave as a timing oracle is out of the
question for a *default* configuration.

It is entirely possible, in my opinion, to conditionalize that on
_FORTIFY_SOURCE, or to provide a future system tunnable that exposes
this same feature at runtime via an IFUNC.

Cheers,
Carlos.

Follow-Ups:
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Florian Weimer

References:
- Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]