This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v3] powerpc: strstr optimization


On Sun, Jul 26, 2015 at 08:26:06PM -0500, Steven Munroe wrote:
> On Sat, 2015-07-25 at 10:20 +0200, OndÅej BÃlka wrote:
> > No my objection was that this should be reverted as patch that wasn't
> > reviewed. My main reason to believe are numerous issues that every
> > competent reviewer should notice. First red flag was Tulio telling its
> > ok for me, then telling that he received message there is bug in strstr.
> > 
> 
> This is ridiculous. most of this personal and unprofessional attack
> below is based on incorrect assumptions (of things you do not actually
> know) or taken out of context and twisted to fit your narrative.
> 
> You really need to take a deep breath and think before you say anything
> else.
>

Could you stop doing ad hominem attacks please? I noticed that you
resort to these when I show that facts speak otherwise.

As this being out of context I expected that you will of course deny
everything and say something like that. So could you answer following
three questions?

1. Reviews, it would be better to always write reviews publictly. But as
you still have them could you send reviews here to clarify.

2. Quadratic behaviour. It should be clear for any expert that with 2047
byte needle you need check all these bytes for each byte in haystack. If
you optimisticaly check 8 bytes per cycle then it would take 256 cycles
times haystack size. In practice its slower as its equivalent to strcmp
which on power7 has following timing:
Length 1024, alignment 14/ 7:   593.5   651.016 318.844 447.203

So why you didn't insist to fix that by changing constant to something
reasonable?

3. Why despite your frequent reviews a implementation is in average case
still slower than simple bugfix of generic strstr.

As for that I do incorrect assumptions and take things out of context
its quite easy to make vague accussations like this. So be more
specific, what assumptions and where is your evidence that assumption 
is incorrect?

 
> 
> > That doesn't inspire much confidence.
> > 
> > First as quadratic behaviour. I knew that its bounded by 2048 but as
> > Tulio told that there is no work done I was silent. As he told that
> > there is no work on that I realized that he either didn't read patch or
> > forgot that he reviewed part which calls strlen, strnlen and jumps to
> > generic strstr.
> > 
> > I and you read code and find 2048 constant weird. As reviewer I asked to 
> > clarify it but author just send unrelated benchmarks, never one that
> > would convince us. It is as on benchmark that I previously send on thread 
> > I show that its 100 times slower for that needle as I claimed. 
> > If Tulio was familiar with code he would suggest to decrease that constant to 32.
> > 
> > As quadratic its mostly shortcut as I don't want to repeat it. Its
> > beyond my understanding when one believes that in practice O(mn) algorithm 
> > with m=2048 could beat O(n) one. I raised that in all versions of
> > strstr/strcasestr.
> > 
> > Thats just simple issue, I raised that months ago so they had enough
> > time.
> > 
> > Then as matter of performance in general case I found several obvious
> > problems so why Tulio didn't? First is that strlen/strnlen is called
> > unnecessarily when first strchr call returns null. Thats hot path.
> > Second is that crosspage check is unnecessary as you could search for
> > ending character of needle.
> > 
> > These are arguments why I don't think it was reviewed.
> > 
> > Joseph, what is policy about assembly implementations when there is
> > better generic implementation? This assembly is slower than when do
> > simple optimization of C strstr, with patch. I repeatedly asked to
> > check it but never got reply. 
> > 
> > [PATCH] Improve generic strstr performance.
> > 
> > I submitted C implementation that beats assembly strstr previously in
> > thread.
> > 
> > To answer objections that benchmarks are inaccurate as strings in
> > practice are not random strings I used dryrun as I said earlier. I
> > recorded strstr calls mainly in gdb program.
> > 
> > I saved profile trace here for replicability.
> > http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2
> > 
> > With assembly I get following:
> > 
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26567826
> >  took 26699020
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26608298
> >  took 26698638
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26752268
> >  took 26876982
> > 
> > 
> > replaying bash took 25841770
> > 
> > But when I replace that with generic improvement it improves perfromance
> > in practice.
> > 
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25096738
> >  took 25187742
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25182840
> >  took 25255354
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25060592
> >  took 25165898
> > 
> > 
> > 
> > 
> > Main reason that I explained before is that it goes in considerable
> > lengths to optimizes cold path while not optimizing hot one. 
> > In random benchmark probability that you match k characters from needle 
> > decreases exponentially (Similar behaviour is observed in practice.) 
> > 
> > This patch for needles less than 8 bytes does rougthly following, larger
> > needles have more complicated code to check 8 bytes of needle at time.
> > 
> > nsize = strlen(needle)
> > if (strnlen(haystack,nsize) < nsize)
> >   return NULL;
> > 
> > if (nsize <= 9)
> >   {
> >     needle_start = *((uint64_t*)needle+1);
> >     mask = (~0) >> (9 - nsize);
> >     while (1)
> >       {
> >         haystack = strchr (haystack, needle[0]);
> >         if (*((uint64_t*)haystack) & mask == needle_start) 
> >           return haystack;
> >         haystack++;
> >       }
> > 
> > Problem is that optimizations like 
> > 
> >    if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
> >           return haystack;
> > 
> > Do more harm than good. Using simple byte check is faster in practice as
> > probably mismatch occurs in second character and overhead of testing
> > more characters is bigger than branch misprediction penalty. When I ran
> > profile trace then average number of digraphs and trigraphs is
> > following:
> > 
> > digraph count:  0.0126 trigraph count:  0.0024
> > 
> > On other hand it does nothing with hot strchr calls. If performance is
> > concern these should be at least inlined, or iterate over first
> > character mask.
> > 
> > I am only telling from my experience what optimizations will help.
> > Optimizations in this patch will not.
> > 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]