This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v3] powerpc: strstr optimization

From: Steven Munroe <munroesj at linux dot vnet dot ibm dot comcom>
To: OndÅej BÃlka <neleai at seznam dot cz>
Cc: Joseph Myers <joseph at codesourcery dot com>, "Carlos O'Donell" <carlos at redhat dot com>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>, GNU C Library <libc-alpha at sourceware dot org>, Steve Munroe <sjmunroe at us dot ibm dot com>, Rajalakshmi Srinivasaraghavan <raji at linux dot vnet dot ibm dot com>
Date: Sun, 26 Jul 2015 20:26:06 -0500
Subject: Re: [PATCH v3] powerpc: strstr optimization
Authentication-results: sourceware.org; auth=none
References: <558A5642 dot 5020107 at linux dot vnet dot ibm dot com> <558A5761 dot 2000409 at linux dot vnet dot ibm dot com> <87oajpm8nc dot fsf at totoro dot br dot ibm dot com> <871tgijuri dot fsf at linux dot vnet dot ibm dot com> <55A6FE3F dot 6090701 at redhat dot com> <55A70B70 dot 6090607 at redhat dot com> <20150716195538 dot GA5140 at domone> <55A8110C dot 7000209 at redhat dot com> <alpine dot DEB dot 2 dot 10 dot 1507221607370 dot 21570 at digraph dot polyomino dot org dot uk> <20150725082025 dot GA5043 at domone>
Reply-to: munroesj at linux dot vnet dot ibm dot com

On Sat, 2015-07-25 at 10:20 +0200, OndÅej BÃlka wrote:
> On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote:
> > On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> > 
> > > On 07/16/2015 03:55 PM, OndÅej BÃlka wrote:
> > > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> > > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> > > >>>> May I proceed with this commit?
> > > >>>
> > > >>> Yes, please commit this for 2.22.
> > > >>
> > > >> For the record I trust IBM to make sure these patches make incremental
> > > >> improvements in performance even if they are not the best possible
> > > >> performance as pointed out by Ondrej Bilka.
> > > >>
> > > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > > > all. I did that as test how much we could test IBM to verify patches. 
> > > > 
> > > > I pointed out that it could have possibly quadratic behaviour which
> > > > still does. So please don't accept unreviewed patches next time.
> > > 
> > > They showed cases for which the code does go faster and objectively
> > > so using the microbenchmark, and that's a win for now. Please continue 
> > > to work with IBM to remove the quadratic worst case.
> > > 
> > > Tulio, You will need to work out why you have quadratic worst case.
> > > It's certainly something we try to avoid. Did you make a particular
> > > decision not to avoid it?
> > 
> > If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> > that a security hole (denial of service) that needs to block the release 
> > of 2.22 until it's fixed (possibly by removing the implementation in 
> > question).
> > 
> No my objection was that this should be reverted as patch that wasn't
> reviewed. My main reason to believe are numerous issues that every
> competent reviewer should notice. First red flag was Tulio telling its
> ok for me, then telling that he received message there is bug in strstr.
> 

This is ridiculous. most of this personal and unprofessional attack
below is based on incorrect assumptions (of things you do not actually
know) or taken out of context and twisted to fit your narrative.

You really need to take a deep breath and think before you say anything
else.


> That doesn't inspire much confidence.
> 
> First as quadratic behaviour. I knew that its bounded by 2048 but as
> Tulio told that there is no work done I was silent. As he told that
> there is no work on that I realized that he either didn't read patch or
> forgot that he reviewed part which calls strlen, strnlen and jumps to
> generic strstr.
> 
> I and you read code and find 2048 constant weird. As reviewer I asked to 
> clarify it but author just send unrelated benchmarks, never one that
> would convince us. It is as on benchmark that I previously send on thread 
> I show that its 100 times slower for that needle as I claimed. 
> If Tulio was familiar with code he would suggest to decrease that constant to 32.
> 
> As quadratic its mostly shortcut as I don't want to repeat it. Its
> beyond my understanding when one believes that in practice O(mn) algorithm 
> with m=2048 could beat O(n) one. I raised that in all versions of
> strstr/strcasestr.
> 
> Thats just simple issue, I raised that months ago so they had enough
> time.
> 
> Then as matter of performance in general case I found several obvious
> problems so why Tulio didn't? First is that strlen/strnlen is called
> unnecessarily when first strchr call returns null. Thats hot path.
> Second is that crosspage check is unnecessary as you could search for
> ending character of needle.
> 
> These are arguments why I don't think it was reviewed.
> 
> Joseph, what is policy about assembly implementations when there is
> better generic implementation? This assembly is slower than when do
> simple optimization of C strstr, with patch. I repeatedly asked to
> check it but never got reply. 
> 
> [PATCH] Improve generic strstr performance.
> 
> I submitted C implementation that beats assembly strstr previously in
> thread.
> 
> To answer objections that benchmarks are inaccurate as strings in
> practice are not random strings I used dryrun as I said earlier. I
> recorded strstr calls mainly in gdb program.
> 
> I saved profile trace here for replicability.
> http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2
> 
> With assembly I get following:
> 
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26567826
>  took 26699020
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26608298
>  took 26698638
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26752268
>  took 26876982
> 
> 
> replaying bash took 25841770
> 
> But when I replace that with generic improvement it improves perfromance
> in practice.
> 
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25096738
>  took 25187742
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25182840
>  took 25255354
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25060592
>  took 25165898
> 
> 
> 
> 
> Main reason that I explained before is that it goes in considerable
> lengths to optimizes cold path while not optimizing hot one. 
> In random benchmark probability that you match k characters from needle 
> decreases exponentially (Similar behaviour is observed in practice.) 
> 
> This patch for needles less than 8 bytes does rougthly following, larger
> needles have more complicated code to check 8 bytes of needle at time.
> 
> nsize = strlen(needle)
> if (strnlen(haystack,nsize) < nsize)
>   return NULL;
> 
> if (nsize <= 9)
>   {
>     needle_start = *((uint64_t*)needle+1);
>     mask = (~0) >> (9 - nsize);
>     while (1)
>       {
>         haystack = strchr (haystack, needle[0]);
>         if (*((uint64_t*)haystack) & mask == needle_start) 
>           return haystack;
>         haystack++;
>       }
> 
> Problem is that optimizations like 
> 
>    if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
>           return haystack;
> 
> Do more harm than good. Using simple byte check is faster in practice as
> probably mismatch occurs in second character and overhead of testing
> more characters is bigger than branch misprediction penalty. When I ran
> profile trace then average number of digraphs and trigraphs is
> following:
> 
> digraph count:  0.0126 trigraph count:  0.0024
> 
> On other hand it does nothing with hot strchr calls. If performance is
> concern these should be at least inlined, or iterate over first
> character mask.
> 
> I am only telling from my experience what optimizations will help.
> Optimizations in this patch will not.
>

Follow-Ups:
- Re: [PATCH v3] powerpc: strstr optimization
  - From: OndÅej BÃlka

References:
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Tulio Magno Quites Machado Filho
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Tulio Magno Quites Machado Filho
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Carlos O'Donell
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Carlos O'Donell
- Re: [PATCH v3] powerpc: strstr optimization
  - From: OndÅej BÃlka
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Carlos O'Donell
- Re: [PATCH v3] powerpc: strstr optimization
  - From: Joseph Myers
- Re: [PATCH v3] powerpc: strstr optimization
  - From: OndÅej BÃlka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]