This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Optimize strstr, strcasestr and memmem
- From: Ondrej Bilka <neleai at seznam dot cz>
- To: Maxim Kuvyrkov <maxim at codesourcery dot com>
- Cc: Carlos O'Donell <carlos at systemhalted dot org>,GLIBC Devel <libc-alpha at sourceware dot org>,Ryan Arnold <rsa at us dot ibm dot com>, Eric Blake <eblake at redhat dot com>
- Date: Fri, 1 Jun 2012 19:16:32 +0200
- Subject: Re: [PATCH] Optimize strstr, strcasestr and memmem
- References: <2C516CF2-D083-4C1D-AD27-6A31D381D548@codesourcery.com><CADZpyiwSMzwbMwwmLG0Sz8RAK8ww_LH_RSt2Nd3-F8+BUDYT-w@mail.gmail.com><0091E444-5D4A-4348-96A6-C693F9F66F1D@codesourcery.com>
On Wed, May 30, 2012 at 08:28:40PM +1200, Maxim Kuvyrkov wrote:
> On 22/05/2012, at 7:49 AM, Carlos O'Donell wrote:
>
> > On Thu, May 17, 2012 at 5:22 PM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> >> This patch optimizes strstr, strcasestr and memmem functions. This patch speeds up strstr, strcase and memmem functions with short needle inputs by more than 2 times on i686, x86_64, MIPS and other architectures.
>
> Carlos,
>
> I appreciate you reviewing the code. I've reworked the patches and will post them in separate sub-threads. Below are comments and notes on the change as a whole.
>
> >
> > How did you measure this e.g. what benchmark code did you use?
>
> Two benchmarks: one is based on libosip (library for parsing SIP messages) and the other one is the testcase from http://sourceware.org/bugzilla/show_bug.cgi?id=11607 . Both benchmark exercise short-needle path of strstr.
>
I also tried my improvements on bug11607 and my code is again twice as fast than sse4.2, 6 times than patch on core2,i7,xeon.
Timing is below.
strstr_glibc.c
20.16user 0.00system 0:20.18elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_maxim.c
6.16user 0.00system 0:06.16elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_sse2_tuned.c
0.91user 0.00system 0:00.92elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.36user 0.00system 0:01.36elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_strlen.c
0.63user 0.00system 0:00.64elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz
stepping : 6
cpu MHz : 2527.000
cache size : 3072 KB
strstr_glibc.c
1.45user 0.00system 0:01.46elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_maxim.c
5.41user 0.00system 0:05.42elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_sse2_tuned.c
0.73user 0.00system 0:00.74elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_strlen.c
0.68user 0.00system 0:00.68elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
stepping : 4
microcode : 0x6
cpu MHz : 1596.000
cache size : 8192 KB
strstr_glibc.c
26.91user 0.00system 0:26.93elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+553minor)pagefaults 0swaps
strstr_maxim.c
7.82user 0.00system 0:07.82elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+553minor)pagefaults 0swaps
strstr_sse2_tuned.c
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.72user 0.00system 0:01.73elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
strstr_strlen.c
0.78user 0.00system 0:00.78elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5130 @ 2.00GHz
stepping : 6
microcode : 0xcd
cpu MHz : 1995.378
cache size : 4096 KB
> >
> > On which machines?
>
> On Core2 (both -m32 and -m64) and MIPS (n32, n64 and o32).
>
> >
> > vs. SSE42 variants for x86/x86_64 multiarch?
>
> SSE42 implementation is still faster than the generic with the improvements on the two benchmark I tested.
>
> >
> > What impact does this have on cases with long needles? If it's a win
> > everywhere I'll be excited, but I figure some of this code has to be
> > on the hot-path for long needles?
>
> Judging from the -25% - +25% and -3% - +7% improvement/regression from the use-pointers patch (see below) there should be an improvement/regression to long-needle case as well, as the patch is symmetrical for both short- and long-needle cases. No other patch touches long needles.
>
> >
> > I'm currently operating under the naive assumption that all input
> > scenarios are equally important (I don't have data to backup anything
> > else). Is this a valid assumption? For example short needles aren't
> > more important than long-needles, so I will find it hard to accept a
> > patch that slows down long-needle searches... at least not without
> > showing that the mean speed across all uses cases went up.
>
> The patches don't slow-down long needles.
>
> >
> >> GLIBC 2.9 introduced new, algorithmically-superior implementation of strstr, strcasestr and memmem functions. Unfortunately, this new implementation uses memchr to detect end-of-string condition which comes at significant overhead compared to piggy-backing matching procedure that GLIBC 2.8 and earlier versions used. The new implementation heavily regressed the case for short needles, making strstr more than 2 times slower. This patch cures the regression and even makes the GLIBC 2.9+ implementation faster than original GLIBC 2.8- version.
> >
> > Do you have data to back up the statement about the performance
> > regression for short needles from 2.8 to 2.9? I know that it's likely
> > true, but if I have to defend your position to our users I need data.
> >
> >> This patch adds several optimizations:
> >>
> >> - Piggy-back the matching procedure to detect end-of-string for strstr and strcasestr when needle is short. This removes the need to use memchr in two_way_short_needle. [The long-needle case hops through parts of the string, and the same optimization cannot be used there as not every haystack character is read in by the matching loop.]
> >>
> >> - Optimize search for the first character in strstr, strcasestr and memmem functions. This is the hot-spot of the functions.
> >
> > These two changes look good, but I'd like someone else to look them
> > over, my preference is to have Ryan Arnold from the IBM team look it
> > over to ensure that Power also has some benefit here. Please ask him
> > to review this.
> >
> >> - Use pointer references instead of array[index] references. This optimization makes it easier for compiler to generate good code, especially on architectures that don't have base+index addressing.
> >
> > Please split this into another patch.
> >
> > We should evaluate this patch on its own merits.
> >
> >> - Use straight tolower() for strcasestr (complexity of isupper() is about the same as tolower()).
> >
> > Isn't this another micro-optimization that could be broken out into
> > another patch?
>
> I've split the patch into 4 logical pieces, I will post them in separate sub-threads. [I know, I should've done it in the first place.]
>
> Benchmark results for Core2 (i.e., no SSE42):
>
> BZ11607 testcase; seconds, smaller is better (Core2 -m32):
>
> baseline 64.82
> use-optimized-first-character- 51.39
> detect-eol-on-the-fly-in-strst 24.14
> use-pointers-for-traversing-ar 19.40
> micro-optimize-critical-path 19.45
>
> BZ11607 testcase; seconds, smaller is better (Core2 -m64):
> use-__syscall_slong_t-in-bits- 62.59
> use-optimized-first-character- 51.37
> detect-eol-on-the-fly-in-strst 11.25
> use-pointers-for-traversing-ar 14.06
> micro-optimize-critical-path 13.70
>
> Judging from the bug report, GLIBC 2.7's strstr should be at about 8 seconds, but I don't have GLIBC 2.7 handy to test that. When I implemented the above optimizations for MIPS I saw "new" strstr with patches performing faster than GLIBC 2.7 implementation. I will close the bug if don't hear to the contrary.
>
> Libosip benchmark; messages/sec, bigger is better (Core2 -m32)
> baseline 51524
> use-optimized-first-character- 55062 +6.9%
> detect-eol-on-the-fly-in-strst 65743 +19.4%
> use-pointers-for-traversing-ar 70281 +6.9%
> micro-optimize-critical-path 70309 +0.0%
>
> Libosip benchmark; messages/sec, bigger is better (Core2 -m64)
> baseline 58081
> use-optimized-first-character- 62456 +7.5%
> detect-eol-on-the-fly-in-strst 85983 +37.7%
> use-pointers-for-traversing-ar 83326 -3.1%
> micro-optimize-critical-path 81377 -2.3%
>
> The benchmarks show that use-pointer pointer patch improves 32-bit x86 by 7% - 25% and regresses 64-bit x86 by the same amount. The improvement for 32-bit x86 comes most likely from decreased register pressure inside the loops -- that's what I saw for MIPS. It seems that current GCC mainline gets unlucky with compiling pointer references for x86_64. I still think the use-pointers patch should be applied as it adds an optimization that is more likely to improve performance for most targets rather than to regress it.
>
> Another evidence of GCC mainline getting unlucky compiling for x86_64 is the micro-optimize-critical-path patch. The patch is obviously beneficial as it removes an operation from the critical path, yet it causes a 2% slowdown for the libosip benchmark.
>
> >>
> >> diff --git a/string/bug-strcasestr1.c b/string/bug-strcasestr1.c
> >> new file mode 100644
> >> index 0000000..16c77c3
> >> --- /dev/null
> >> +++ b/string/bug-strcasestr1.c
> >> @@ -0,0 +1,21 @@
> >
> > Please add a license disclaimer to this code.
>
> Would you please give a pointer? I don't see tests having license notices.
>
> Thank you,
>
> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
>
>
>