This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Florian Weimer <fweimer at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Cc: OndÅej BÃlka <neleai at seznam dot cz>
- Date: Mon, 03 Feb 2014 16:30:37 -0500
- Subject: Re: [PATCH] Optimize SSE 4.1 x86_64 memcmp
- Authentication-results: sourceware.org; auth=none
- References: <52EBBCC2 dot 7090807 at redhat dot com> <52EBC480 dot 4000509 at redhat dot com> <52EF6E9E dot 8050708 at redhat dot com>
On 02/03/2014 05:25 AM, Florian Weimer wrote:
> On 01/31/2014 04:42 PM, Carlos O'Donell wrote:
>
>> * More comments in the assembly explaining the structure of the new
>> code and how it operates.
>
> Okay, the comment-worthy aspects might be:
>
> Some meta-comment explaining the byte numbering which illustrates the
> register contents in the corner cases.
>
> The rationale for the big-endian conversion (to match address
> ordering), separate 32-bit and 64-bit cases (64-bit BSWAP is much
> slower), and perhaps the conversion from flags to the -1/1 result
> using SBB/OR. (All this applies to diffin4bytes/diffin8bytes.)
>
> Unaligned loads overlapping with data previously compared as equal.
> This data cannot affect the outcome of the final comparison. (This
> trick was already used before, but it's not entirely obvious.)
>
> The SETNE/CMOVAE sequence for the branchless sign function.
>
> Cases where the sign function isn't needed because the result of the
> sign subtraction fits into the 32-bit int result.
>
> Anything else?
No, that looks good from a commenting perspective.
>> * Verification via the microbenchmarks that it actually makes
>> things faster. Additionally we'd like you add to the
>> microbenchmarks if you find we need another test to measure your
>> changes. See benchtests/README. We want to be as objective as
>> possible here and allow others to run the same or similar tests to
>> compare your patch on their hardware.
>
> There already is bench-memcmp. I analyzed its output (after bumping
> the inner loop count to a reasonable value) and it does not show that
> the new implementation is slower across the board. I also verified
> this using R, with a paired t-test: the new implementation appears to
> be slightly faster, but it's not statistically significant.
Thanks, that's what I wanted to know.
> In addition, I performed specialized comparisons for varying lengths
> and shared prefixes, and benchmarked qsort on random data and
> /usr/share/dict/words. My test loop uses CPUID, yet it turns out
> that getting good measurements of individual memcmp calls is really
> quite tricky. My benchmark code is available here:
> <https://github.com/fweimer/memcmp-timing>
Do you think it would be useful to integrate your benchmark as an
additional test, or even replace the memcmp test we have with yours?
> What would be a reasonable benchmark? We could read the manual
> sources, permute their lines randomly, sort them using qsort, and
> measure the total time used for sorting. Or we could generate random
> input data based on some distribution and measure sorting time for
> that. The challenge is to get reasonable input data without bundling
> large files. If we reuse something that is part of the source code
> distribution anyway, comparisons between different releases are not
> valid, but so is anything that uses qsort. But I would recommend the
> sorting approach because it seems realistic to me, and qsort does not
> all that much overhead over the actual comparison.
We do not know what would be a reasonable benchmark or we would
have it :-)
We are trying to gather consensus on reasonable benchmarks and
put them into the glibc microbenchmarks.
We know that at some point we need to add whole-system benchmarking,
but we aren't there yet. Ondrej did propose a framework for this
but it hasn't been reviewed.
> One thing I would like to get consensus on eventually is whether
> future implementations of memcmp must not behave as a timing oracle.
> (There might be architectures where achieving this has a run-time
> cost.) If we are unwilling to make this promise, the oracle-free
> version would have to be conditional on _FORTIFY_SOURCE.
The present goal of the library is to be a high performance C library.
Requiring that memcmp not behave as a timing oracle is out of the
question for a *default* configuration.
It is entirely possible, in my opinion, to conditionalize that on
_FORTIFY_SOURCE, or to provide a future system tunnable that exposes
this same feature at runtime via an IFUNC.
Cheers,
Carlos.