This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Optimize strstr, strcasestr and memmem


On Wed, May 30, 2012 at 08:28:40PM +1200, Maxim Kuvyrkov wrote:
> On 22/05/2012, at 7:49 AM, Carlos O'Donell wrote:
> 
> > On Thu, May 17, 2012 at 5:22 PM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> >> This patch optimizes strstr, strcasestr and memmem functions.  This patch speeds up strstr, strcase and memmem functions with short needle inputs by more than 2 times on i686, x86_64, MIPS and other architectures.
> 
> Carlos,
> 
> I appreciate you reviewing the code.  I've reworked the patches and will post them in separate sub-threads.  Below are comments and notes on the change as a whole.
> 
> > 
> > How did you measure this e.g. what benchmark code did you use?
> 
> Two benchmarks: one is based on libosip (library for parsing SIP messages) and the other one is the testcase from http://sourceware.org/bugzilla/show_bug.cgi?id=11607 .  Both benchmark exercise short-needle path of strstr.
> 
I also tried my improvements on bug11607 and my code is again twice as fast than sse4.2, 6 times than patch on core2,i7,xeon.
Timing is below.

 
strstr_glibc.c
20.16user 0.00system 0:20.18elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_maxim.c
6.16user 0.00system 0:06.16elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_sse2_tuned.c
0.91user 0.00system 0:00.92elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.36user 0.00system 0:01.36elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_strlen.c
0.63user 0.00system 0:00.64elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
stepping	: 6
cpu MHz		: 2527.000
cache size	: 3072 KB


strstr_glibc.c
1.45user 0.00system 0:01.46elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_maxim.c
5.41user 0.00system 0:05.42elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
strstr_sse2_tuned.c
0.73user 0.00system 0:00.74elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+554minor)pagefaults 0swaps
strstr_strlen.c
0.68user 0.00system 0:00.68elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+555minor)pagefaults 0swaps
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping	: 4
microcode	: 0x6
cpu MHz		: 1596.000
cache size	: 8192 KB



strstr_glibc.c
26.91user 0.00system 0:26.93elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+553minor)pagefaults 0swaps
strstr_maxim.c
7.82user 0.00system 0:07.82elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+553minor)pagefaults 0swaps
strstr_sse2_tuned.c
1.17user 0.00system 0:01.17elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
strstr_ssse3_two_way.c
1.72user 0.00system 0:01.73elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
strstr_strlen.c
0.78user 0.00system 0:00.78elapsed 99%CPU (0avgtext+0avgdata 7536maxresident)k
0inputs+0outputs (0major+552minor)pagefaults 0swaps
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Xeon(R) CPU            5130  @ 2.00GHz
stepping	: 6
microcode	: 0xcd
cpu MHz		: 1995.378
cache size	: 4096 KB

> > 
> > On which machines?
> 
> On Core2 (both -m32 and -m64) and MIPS (n32, n64 and o32).
> 
> > 
> > vs. SSE42 variants for x86/x86_64 multiarch?
> 
> SSE42 implementation is still faster than the generic with the improvements on the two benchmark I tested.
> 
> > 
> > What impact does this have on cases with long needles? If it's a win
> > everywhere I'll be excited, but I figure some of this code has to be
> > on the hot-path for long needles?
> 
> Judging from the -25% - +25% and -3% - +7% improvement/regression from the use-pointers patch (see below) there should be an improvement/regression to long-needle case as well, as the patch is symmetrical for both short- and long-needle cases.  No other patch touches long needles.
> 
> > 
> > I'm currently operating under the naive assumption that all input
> > scenarios are equally important (I don't have data to backup anything
> > else). Is this a valid assumption? For example short needles aren't
> > more important than long-needles, so I will find it hard to accept a
> > patch that slows down long-needle searches... at least not without
> > showing that the mean speed across all uses cases went up.
> 
> The patches don't slow-down long needles.
> 
> > 
> >> GLIBC 2.9 introduced new, algorithmically-superior implementation of strstr, strcasestr and memmem functions. Unfortunately, this new implementation uses memchr to detect end-of-string condition which comes at significant overhead compared to piggy-backing matching procedure that GLIBC 2.8 and earlier versions used.  The new implementation heavily regressed the case for short needles, making strstr more than 2 times slower.  This patch cures the regression and even makes the GLIBC 2.9+ implementation faster than original GLIBC 2.8- version.
> > 
> > Do you have data to back up the statement about the performance
> > regression for short needles from 2.8 to 2.9? I know that it's likely
> > true, but if I have to defend your position to our users I need data.
> > 
> >> This patch adds several optimizations:
> >> 
> >> - Piggy-back the matching procedure to detect end-of-string for strstr and strcasestr when needle is short.  This removes the need to use memchr in two_way_short_needle.  [The long-needle case hops through parts of the string, and the same optimization cannot be used there as not every haystack character is read in by the matching loop.]
> >> 
> >> - Optimize search for the first character in strstr, strcasestr and memmem functions.  This is the hot-spot of the functions.
> > 
> > These two changes look good, but I'd like someone else to look them
> > over, my preference is to have Ryan Arnold from the IBM team look it
> > over to ensure that Power also has some benefit here. Please ask him
> > to review this.
> > 
> >> - Use pointer references instead of array[index] references.  This optimization makes it easier for compiler to generate good code, especially on architectures that don't have base+index addressing.
> > 
> > Please split this into another patch.
> > 
> > We should evaluate this patch on its own merits.
> > 
> >> - Use straight tolower() for strcasestr (complexity of isupper() is about the same as tolower()).
> > 
> > Isn't this another micro-optimization that could be broken out into
> > another patch?
> 
> I've split the patch into 4 logical pieces, I will post them in separate sub-threads.  [I know, I should've done it in the first place.]
> 
> Benchmark results for Core2 (i.e., no SSE42):
> 
> BZ11607 testcase; seconds, smaller is better (Core2 -m32):
> 
> baseline			64.82
> use-optimized-first-character-	51.39
> detect-eol-on-the-fly-in-strst	24.14
> use-pointers-for-traversing-ar	19.40
> micro-optimize-critical-path	19.45
> 
> BZ11607 testcase; seconds, smaller is better (Core2 -m64):
> use-__syscall_slong_t-in-bits-	62.59
> use-optimized-first-character-	51.37
> detect-eol-on-the-fly-in-strst	11.25
> use-pointers-for-traversing-ar	14.06
> micro-optimize-critical-path	13.70
> 
> Judging from the bug report, GLIBC 2.7's strstr should be at about 8 seconds, but I don't have GLIBC 2.7 handy to test that.  When I implemented the above optimizations for MIPS I saw "new" strstr with patches performing faster than GLIBC 2.7 implementation.  I will close the bug if don't hear to the contrary.
> 
> Libosip benchmark; messages/sec, bigger is better (Core2 -m32)
> baseline			51524
> use-optimized-first-character-	55062	+6.9%
> detect-eol-on-the-fly-in-strst	65743	+19.4%
> use-pointers-for-traversing-ar	70281	+6.9%
> micro-optimize-critical-path	70309	+0.0%
> 
> Libosip benchmark; messages/sec, bigger is better (Core2 -m64)
> baseline			58081
> use-optimized-first-character-	62456	+7.5%
> detect-eol-on-the-fly-in-strst	85983	+37.7%
> use-pointers-for-traversing-ar	83326	-3.1%
> micro-optimize-critical-path	81377	-2.3%
> 
> The benchmarks show that use-pointer pointer patch improves 32-bit x86 by 7% - 25% and regresses 64-bit x86 by the same amount.  The improvement for 32-bit x86 comes most likely from decreased register pressure inside the loops -- that's what I saw for MIPS.  It seems that current GCC mainline gets unlucky with compiling pointer references for x86_64.  I still think the use-pointers patch should be applied as it adds an optimization that is more likely to improve performance for most targets rather than to regress it.
> 
> Another evidence of GCC mainline getting unlucky compiling for x86_64 is the micro-optimize-critical-path patch.  The patch is obviously beneficial as it removes an operation from the critical path, yet it causes a 2% slowdown for the libosip benchmark.
> 
> >> 
> >> diff --git a/string/bug-strcasestr1.c b/string/bug-strcasestr1.c
> >> new file mode 100644
> >> index 0000000..16c77c3
> >> --- /dev/null
> >> +++ b/string/bug-strcasestr1.c
> >> @@ -0,0 +1,21 @@
> > 
> > Please add a license disclaimer to this code.
> 
> Would you please give a pointer?  I don't see tests having license notices.
> 
> Thank you,
> 
> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
> 
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]