This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] Optimize strstr, strcasestr and memmem
- From: Maxim Kuvyrkov <maxim at codesourcery dot com>
- To: Eric Blake <eblake at redhat dot com>, Carlos O'Donell <carlos at systemhalted dot org>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>,OndÅej BÃlka <neleai at seznam dot cz>, "Joseph S. Myers"<joseph at codesourcery dot com>
- Date: Sat, 19 May 2012 11:15:31 +1200
- Subject: Re: [PATCH] Optimize strstr, strcasestr and memmem
- References: <2C516CF2-D083-4C1D-AD27-6A31D381D548@codesourcery.com> <Pine dot LNX dot 4 dot 64 dot 1205172218020 dot 28988 at digraph dot polyomino dot org dot uk> <7416069C-E1FD-4F64-81DD-F09C726E63A0 at codesourcery dot com> <20120518061821 dot GA2911 at domone dot kolej dot mff dot cuni dot cz> <CADZpyiyMWPGm+2RzqKE9_f-N=ZqgFs2goZJGV8w2wv2=F_-hnQ@mail.gmail.com> <4FB6BE5E.email@example.com>
On 19/05/2012, at 9:25 AM, Eric Blake wrote:
> On 12/31/1969 05:00 PM, wrote:
>>> I use only SSE2 instructions and my algorithm is faster than using SSE4 instructions.
>>> With trivial modifications I could also use plain arithmetic/AVX2
> I knew that when I first posted the twoway strstr implementation that it
> was not optimal for short searches, and I totally welcome these attempts
> to approve things. My biggest concern is the avoidance of quadratic
> effects (strstr should always be linear speed, as proven by the twoway
> algorithm, and memmem should be sublinear with large enough needles, by
> skipping over portions of the haystack).
WRT to the patch I posted, it doesn't add quadratic behavior whatsoever, it reduces the linear factor for small strings (which currently is proportional to the length of the needle due to use of memchr).
>> The biggest problem I have with reviewing *any* of this code is that
>> performance is relative to benchmarks.
>> As a community we have no baseline benchmark numbers.
> Precisely - the very reason that I have not tried to further optimize
> anything in glibc beyond my twoway string, and the reason the SSE4
> quadratic pessimization even got in to glibc in the first place, is
> because we don't have a good benchmark.
I've been working on benchmarking patches for a couple of GLIBC subsystems lately (strstr and malloc), and it turned out to be a much more difficult task than it should be. I'm thinking on design of a benchmarking harness for GLIBC testsuite to effectively address performance regressions and patch submissions.
The testsuite already has good coverage of the GLIBC functions, so I'm thinking to wrapping that into a benchmarking harness to get performance measurements out of what we already have, not creating subsystem-specific benchmarks. The current idea I'm looking at is extending test-skeleton framework to gather performance numbers.
The benchmark design goals that I have so far:
1. Benchmarks will run a fixed amount of time, say 30 seconds. This will be controlled by a timer in test-skeleton.c similar to what is used to handle tests that time-out. We cannot afford to significantly extend the time it takes GLIBC testsuite to run.
2. Test-skeleton.c will be executing do_test() function in a loop, and the number of full executions divided by the time the executions took will be benchmark's score.
3. There is no goal to have benchmark score comparable between different systems. Benchmark scores will be meaningful only within a single system/setup. The goal is to provide supporting arguments for patch submissions and regression tracking. For a second time in two years I'm looking at what in GLIBC has caused a 50% slowdown on a customer benchmark -- that's overall benchmark slowdown, which means something in GLIBC got 5+ times slower. Surely we can do better.
4. For tests that can handle dummy implementations of GLIBC functions (i.e., function always returning the same value or otherwise being very fast) there will be a "training" mode to calculate the cost of the testing code / loops / etc that is _not_ the functions being benchmarked. After running the benchmark in train mode for, say, 5 seconds the functions will be switched from dummy to the real ones and benchmarked. Such approach will allow to precisely estimate performance of just the GLIBC functions of interest, removing the overhead of the testing/benchmarking harness from the overall benchmark score.
>> I would like to see someone from the community come forward to help
>> sort this out.
>> I think at a minimum we need:
>> * A list of baseline benchmarks we are going to use to evaluate
>> performance related submissions.
As I said above, I think creating tests that are only benchmarks will be too much effort for anyone to actually do it.
>> * Collect data on the runs.
... as part of default testsuite run. My current preference is for a CSV-formated summary file with a single number per benchmark.
>> * Record everything in the wiki including how to run the benchmarks
>> and the collection of growing results.
Yes. I would prefer people with automated continuos build/test setups to push the CSV files into Google spreadsheets with a pretty graph showing geomeans over benchmarks over time. We then will embed those "live" graphs into the wiki.
CodeSourcery / Mentor Graphics