This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] powerpc: strcasestr optimization

From: OndÅej BÃlka <neleai at seznam dot cz>
To: Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
Cc: Roland McGrath <roland at hack dot frob dot com>, munroesj at linux dot vnet dot ibm dot com, GNU C Library <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>
Date: Tue, 16 Jun 2015 09:21:22 +0200
Subject: Re: [PATCH] powerpc: strcasestr optimization
Authentication-results: sourceware.org; auth=none
References: <1433169407 dot 10235 dot 5 dot camel at sjmunroe-ThinkPad-W500> <20150601162215 dot GA8955 at domone> <1433270823 dot 14958 dot 15 dot camel at sjmunroe-ThinkPad-W500> <20150602210000 dot GB20004 at domone> <1433882432 dot 21101 dot 71 dot camel at sjmunroe-ThinkPad-W500> <20150609210555 dot 3A7CA2C3BDC at topped-with-meat dot com> <1433885184 dot 21101 dot 83 dot camel at sjmunroe-ThinkPad-W500> <20150609214202 dot 5D0612C3BE3 at topped-with-meat dot com> <20150610104842 dot GA4871 at domone> <55788320 dot 5090405 at linux dot vnet dot ibm dot com>

On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> 
> On 06/10/2015 04:18 PM, OndÅej BÃlka wrote:
> >
> >  The onus for making all this happen is on the person who wants to change the code from its status quo.
> >
> >That would mean that its Rajalakshmi and responsibility to
> >write benchmark that shows its performance improvement. I stated my
> >objections explicitly described which inputs to use.
> >
> >Problem is that Steven failed to do his duty as a powerpc maintainer
> >which is to read and review change, check benchmark inputs and that
> >they are adequate. I could supply really bad implementation of strstr
> >which first checks special case that needle and haystack are periodic
> >abcdef sequences and returns result expected from benchmark.
> >
> >I did just simple request that every freshmen should know and Steven
> >should as powerpc maintainer check how its handled. Its that naive
> >algorithm and proposed algorithm are slow on inputs of form
> >
> >strstr("aaaaa....aaaa","aaaa...aaab");
> >
> >As from your previous mail its their responsibility to add that case for
> >benchmark. A 'solution' on v2 would be to switch into two-way algorithm
> >when size is bigger than 2048. That isn't solution as you still have
> >problem with size 2047. I seen somewhere comment that its security
> >problem when you have website with search function on user comments and
> >implementation internally uses strstr then attacker could just post
> >comment with aaaaa...a and search few times for aa...ab.
> >
> 2048 number is chosen by checking various length inputs and
> comparing the performance. So for 2047 case, there wont be worst
> performance.
>
What are you talking about? I am saying that aaaa haystack with aaa...ab
 needle of size 2047 is worst case for your algorithm performance. 

 
> >Roland as its really basic issue I think that its proposer
> >responsibility to check. Also for strstr thread Carlos said about strstr
> >quadratic case:
> >
> >>We should add gnulib's test to the microbenchmarks.
> >
> >
> >Then as I commented that algorithm is flawed its from my experience,
> >Raji's description was:
> >
> >For case 1, I always compare 8 characters at a time using cmpb instruction.
> >instruction.
> >-> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
> >strchr result.
> >-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
> >needle and compare them.
> >If they are same proceed to next 8 bytes of haystack and needle.
> >Else load next doubleword from haystack.Form a doubleword with last
> >seven characters from first load and first character from
> >second load and compare again with needle. Proceed in a similar way.
> >This gives good improvement.
> >
> Lets not mix the review comments of strstr and strcasestr.(as the
> algorithm is different)

>From webster dictionary

"Full Definition of ALGORITHM :  
a procedure for solving a mathematical problem (as of finding the
greatest common divisor) in a finite number of steps that frequently
involves repetition of an operation; broadly :  a step-by-step procedure
for solving a problem or accomplishing some end especially by a computer"

Changing techical few details like applying parallel tolower does not
make new algorithm.

Your strstr and strcasestr are really same algorithm. 

And my point still applies, as you could change generic strcasestr from

while(s=strchr(s,n[0])) 

to

if (bijective_locale)
  {
    char fn[3];
    fn[0]=tolower(n[0]);
    fn[1]=toupper(n[0]);
    fn[2]=0;
    while(s=strpbrk(s,fn)) 

and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most
twice slower than strchr and match is twice likely would be faster than
what you gain and easier to implement.

> >My comment was that it wastes time most of time as in common case these
> >already differ in first byte so all these shifts were completely
> >unnecessary, a simple optimization of naive algorithm using
> >
> >while(s=strchr(s,n[0]))
> If I use while(s=strchr(s,n[0]) as mentioned in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
> the quadratic testcase gives worst peformance. Attached the testcase.

Thats irrelevant testcase. I asked two things. First is show how big
regression you get on pathological inputs. So submit data when you use
your testcase with m=2047 and strstr executed 10000 times in loop with
current algorithm and with your proposed one. That is what I ask.

Second is about average case perforance. There worst case is irrelevant
as you use buy-or-rent techniques to handle it. I sumbitted following
patch with easy way to handle that.

[PATCH] Improve generic strstr performance.

its currently superseeded by generic vector bigraph search but it should
beat your strstr.

> >
> >would be faster. Steven instead of doing his duty as powerpc maintainer
> >and checking result of benchtest that I send instead tried to bullshit
> >himself by telling that powerpc architecture is too special. Then he
> >asked me do do benchmark instead as stalling tactics.
> >
> >If he did his duty of reading benchmark results there is already
> >evidence. First to notice is there are regressions in original
> >benchmark, like following:
> >
> >Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
> >Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688
> >
> I agree there are few regression cases(79 out of 13442 cases) in
> 'proposed_benchresults.txt' attached in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
> This is minimal when compared to the overall performance gain.
> 
What overall performance gain? Did you try to find application and
measure performance difference before and after? Big sizes are not very
relevant in practice. You look behind 32'th byte of haystack only in
1.5% of calls so 10% regression at these sizes is much worse that 30%
gain on 256 byte range. In my profiler a simple collection shows
following


calls 13561
success probability   3.0%
average needle size:    8.9    ns <= 0:   0.4% ns <= 4:   0.8% ns <= 8: 5.9% ns <= 16:  99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0%
ns <= 64: 100.0% 
average n:   10.6    n <= 0:   0.4% n <= 4:   1.3% n <= 8:   1.3% n <= 16:  97.0% n <= 24:  97.8% n <= 32:  98.5% n <= 48: 99.9% n <= 64:  99.9% 
s aligned to 4 bytes:  98.6%  8 bytes:  98.6% 16 bytes:  96.0% 


> >That he should comment, as strstr on average end on 10.6 byte after
> >haystack start its quite relevant.
> >
> >But if he read proposed benchmark where needle and haystack are randomly
> >generated from list of 128 characters then he would easily see it.
> I have used your proposed benchtest, and it always gives improvement
> for strcasestr.Please check the file 'newresults' in
> https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html
> >
> >Here is trimmed version to see pattern. Note that first there are lot of
> >times where performance is less than 13. Thats case where strchr didn't
> >find first character and we just return(generic case now does
> >unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)
> >
> >Its natural to ask what would happen if you called strchr again. In 120
> >bytes it would probably didn't found second character and we would
> >finish in 26 cycles at most not 111.5
> >
> >I expect that average loop performance would be 13.4375/135 = 0.1 cycles
> >per byte not cycle per byte that this algorithm exhibits after he loses
> >performance gains from first correct strchr call.
> >
> >                         simple_strstr   stupid_strstr   __strstr_power7
> >__strstr_ppc
> >
> >Length    4/2, alignment  0/ 3, found:  9.35938 8.09375 7.90625 8.5
> >Length   12/2, alignment  3/ 9, fail:   9.23438 16.5625 7.90625 9.20312
> >Length   24/2, alignment  3/ 9, fail:   9.54688 26.2812 8.32812 10.1875
> >Length   44/4, alignment  9/ 3, found:  13.4688 40.9219 9.59375 13.9688
> >Length   44/4, alignment  9/ 3, fail:   13.4688 40.9531 50.6875 13.9844
> >Length   52/4, alignment  3/ 9, found:  13.5312 47.2812 9.5625  14.7344
> >Length   52/4, alignment  3/ 9, fail:   45.9688 53.625  35.9688 47.5938
> >Length   75/5, alignment 15/ 9, fail:   73.75   67.9375 64.2188 75.9062
> >Length   75/5, alignment 15/15, found:  15.2188 65.9062 10.7344 17.8438
> >Length   75/5, alignment 15/15, fail:   61.5469 68.6562 50.375  63.0312
> >Length  120/8, alignment 15/ 3, fail:   113.281 111.5   103.5   114.406
> >Length  120/8, alignment 15/ 9, found:  21.25   99.1406 12.8438 28.1406
> >Length  135/9, alignment 15/15, found:  23.125  110.734 13.4375 30.7969
> >Length  135/9, alignment 15/15, fail:   147.484 113.781 110.516 165.984
> >Length  168/12, alignment  3/15, found: 31.2344 134.172 13.6719 39.9375
> >Length  168/12, alignment  3/15, fail:  170.766 137.609 131.859 179.859
> >Length  168/12, alignment  9/ 0, found: 100.562 137.922 76.6562 112.578
> >Length  168/12, alignment  9/ 0, fail:  168.703 144.078 144.828 181.203
> >Length  168/12, alignment  9/ 3, found: 129.938 137.703 109.828 136.266
> >Length  168/12, alignment  9/ 3, fail:  178.734 137.875 165.469 183.609
> >Length  465/31, alignment 15/ 0, fail:  73.0469 363.375 32.2969 87.0625
> >Length  465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
> >Length  465/31, alignment 15/ 3, fail:  512.922 373.188 437.047 556.844
> >Length  465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
> >Length  465/31, alignment 15/ 9, fail:  484.312 376.406 423.703 513.672
> >Length  465/31, alignment 15/15, found: 326.859 376     256.625 350.422
> >Length  465/31, alignment 15/15, fail:  512.641 376.203 439.234 545.906
> >Length 131071/16, alignment  0/ 0, found:       116583  108378  122125
> >127809
> >Length 131071/16, alignment  0/ 0, fail:        118298  108639  122208
> >128700
> >
> >

Follow-Ups:
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Rajalakshmi Srinivasaraghavan

References:
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Steven Munroe
- Re: [PATCH] powerpc: strcasestr optimization
  - From: OndÅej BÃlka
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Steven Munroe
- Re: [PATCH] powerpc: strcasestr optimization
  - From: OndÅej BÃlka
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Steven Munroe
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Roland McGrath
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Steven Munroe
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Roland McGrath
- Re: [PATCH] powerpc: strcasestr optimization
  - From: OndÅej BÃlka
- Re: [PATCH] powerpc: strcasestr optimization
  - From: Rajalakshmi Srinivasaraghavan

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]