This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Optimize strstr, strcasestr and memmem

From: Maxim Kuvyrkov <maxim at codesourcery dot com>
To: Carlos O'Donell <carlos at systemhalted dot org>
Cc: GLIBC Devel <libc-alpha at sourceware dot org>, Ryan Arnold <rsa at us dot ibm dot com>,Eric Blake <eblake at redhat dot com>
Date: Wed, 30 May 2012 20:28:40 +1200
Subject: Re: [PATCH] Optimize strstr, strcasestr and memmem
References: <2C516CF2-D083-4C1D-AD27-6A31D381D548@codesourcery.com> <CADZpyiwSMzwbMwwmLG0Sz8RAK8ww_LH_RSt2Nd3-F8+BUDYT-w@mail.gmail.com>

On 22/05/2012, at 7:49 AM, Carlos O'Donell wrote:

> On Thu, May 17, 2012 at 5:22 PM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>> This patch optimizes strstr, strcasestr and memmem functions.  This patch speeds up strstr, strcase and memmem functions with short needle inputs by more than 2 times on i686, x86_64, MIPS and other architectures.

Carlos,

I appreciate you reviewing the code.  I've reworked the patches and will post them in separate sub-threads.  Below are comments and notes on the change as a whole.

> 
> How did you measure this e.g. what benchmark code did you use?

Two benchmarks: one is based on libosip (library for parsing SIP messages) and the other one is the testcase from http://sourceware.org/bugzilla/show_bug.cgi?id=11607 .  Both benchmark exercise short-needle path of strstr.

> 
> On which machines?

On Core2 (both -m32 and -m64) and MIPS (n32, n64 and o32).

> 
> vs. SSE42 variants for x86/x86_64 multiarch?

SSE42 implementation is still faster than the generic with the improvements on the two benchmark I tested.

> 
> What impact does this have on cases with long needles? If it's a win
> everywhere I'll be excited, but I figure some of this code has to be
> on the hot-path for long needles?

Judging from the -25% - +25% and -3% - +7% improvement/regression from the use-pointers patch (see below) there should be an improvement/regression to long-needle case as well, as the patch is symmetrical for both short- and long-needle cases.  No other patch touches long needles.

> 
> I'm currently operating under the naive assumption that all input
> scenarios are equally important (I don't have data to backup anything
> else). Is this a valid assumption? For example short needles aren't
> more important than long-needles, so I will find it hard to accept a
> patch that slows down long-needle searches... at least not without
> showing that the mean speed across all uses cases went up.

The patches don't slow-down long needles.

> 
>> GLIBC 2.9 introduced new, algorithmically-superior implementation of strstr, strcasestr and memmem functions. Unfortunately, this new implementation uses memchr to detect end-of-string condition which comes at significant overhead compared to piggy-backing matching procedure that GLIBC 2.8 and earlier versions used.  The new implementation heavily regressed the case for short needles, making strstr more than 2 times slower.  This patch cures the regression and even makes the GLIBC 2.9+ implementation faster than original GLIBC 2.8- version.
> 
> Do you have data to back up the statement about the performance
> regression for short needles from 2.8 to 2.9? I know that it's likely
> true, but if I have to defend your position to our users I need data.
> 
>> This patch adds several optimizations:
>> 
>> - Piggy-back the matching procedure to detect end-of-string for strstr and strcasestr when needle is short.  This removes the need to use memchr in two_way_short_needle.  [The long-needle case hops through parts of the string, and the same optimization cannot be used there as not every haystack character is read in by the matching loop.]
>> 
>> - Optimize search for the first character in strstr, strcasestr and memmem functions.  This is the hot-spot of the functions.
> 
> These two changes look good, but I'd like someone else to look them
> over, my preference is to have Ryan Arnold from the IBM team look it
> over to ensure that Power also has some benefit here. Please ask him
> to review this.
> 
>> - Use pointer references instead of array[index] references.  This optimization makes it easier for compiler to generate good code, especially on architectures that don't have base+index addressing.
> 
> Please split this into another patch.
> 
> We should evaluate this patch on its own merits.
> 
>> - Use straight tolower() for strcasestr (complexity of isupper() is about the same as tolower()).
> 
> Isn't this another micro-optimization that could be broken out into
> another patch?

I've split the patch into 4 logical pieces, I will post them in separate sub-threads.  [I know, I should've done it in the first place.]

Benchmark results for Core2 (i.e., no SSE42):

BZ11607 testcase; seconds, smaller is better (Core2 -m32):

baseline			64.82
use-optimized-first-character-	51.39
detect-eol-on-the-fly-in-strst	24.14
use-pointers-for-traversing-ar	19.40
micro-optimize-critical-path	19.45

BZ11607 testcase; seconds, smaller is better (Core2 -m64):
use-__syscall_slong_t-in-bits-	62.59
use-optimized-first-character-	51.37
detect-eol-on-the-fly-in-strst	11.25
use-pointers-for-traversing-ar	14.06
micro-optimize-critical-path	13.70

Judging from the bug report, GLIBC 2.7's strstr should be at about 8 seconds, but I don't have GLIBC 2.7 handy to test that.  When I implemented the above optimizations for MIPS I saw "new" strstr with patches performing faster than GLIBC 2.7 implementation.  I will close the bug if don't hear to the contrary.

Libosip benchmark; messages/sec, bigger is better (Core2 -m32)
baseline			51524
use-optimized-first-character-	55062	+6.9%
detect-eol-on-the-fly-in-strst	65743	+19.4%
use-pointers-for-traversing-ar	70281	+6.9%
micro-optimize-critical-path	70309	+0.0%

Libosip benchmark; messages/sec, bigger is better (Core2 -m64)
baseline			58081
use-optimized-first-character-	62456	+7.5%
detect-eol-on-the-fly-in-strst	85983	+37.7%
use-pointers-for-traversing-ar	83326	-3.1%
micro-optimize-critical-path	81377	-2.3%

The benchmarks show that use-pointer pointer patch improves 32-bit x86 by 7% - 25% and regresses 64-bit x86 by the same amount.  The improvement for 32-bit x86 comes most likely from decreased register pressure inside the loops -- that's what I saw for MIPS.  It seems that current GCC mainline gets unlucky with compiling pointer references for x86_64.  I still think the use-pointers patch should be applied as it adds an optimization that is more likely to improve performance for most targets rather than to regress it.

Another evidence of GCC mainline getting unlucky compiling for x86_64 is the micro-optimize-critical-path patch.  The patch is obviously beneficial as it removes an operation from the critical path, yet it causes a 2% slowdown for the libosip benchmark.

>> 
>> diff --git a/string/bug-strcasestr1.c b/string/bug-strcasestr1.c
>> new file mode 100644
>> index 0000000..16c77c3
>> --- /dev/null
>> +++ b/string/bug-strcasestr1.c
>> @@ -0,0 +1,21 @@
> 
> Please add a license disclaimer to this code.

Would you please give a pointer?  I don't see tests having license notices.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

References:
- [PATCH] Optimize strstr, strcasestr and memmem
  - From: Maxim Kuvyrkov
- Re: [PATCH] Optimize strstr, strcasestr and memmem
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]