This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v3] powerpc: strstr optimization


On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote:
> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> 
> > On 07/16/2015 03:55 PM, OndÅej BÃlka wrote:
> > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> > >>>> May I proceed with this commit?
> > >>>
> > >>> Yes, please commit this for 2.22.
> > >>
> > >> For the record I trust IBM to make sure these patches make incremental
> > >> improvements in performance even if they are not the best possible
> > >> performance as pointed out by Ondrej Bilka.
> > >>
> > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > > all. I did that as test how much we could test IBM to verify patches. 
> > > 
> > > I pointed out that it could have possibly quadratic behaviour which
> > > still does. So please don't accept unreviewed patches next time.
> > 
> > They showed cases for which the code does go faster and objectively
> > so using the microbenchmark, and that's a win for now. Please continue 
> > to work with IBM to remove the quadratic worst case.
> > 
> > Tulio, You will need to work out why you have quadratic worst case.
> > It's certainly something we try to avoid. Did you make a particular
> > decision not to avoid it?
> 
> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> that a security hole (denial of service) that needs to block the release 
> of 2.22 until it's fixed (possibly by removing the implementation in 
> question).
> 
No my objection was that this should be reverted as patch that wasn't
reviewed. My main reason to believe are numerous issues that every
competent reviewer should notice. First red flag was Tulio telling its
ok for me, then telling that he received message there is bug in strstr.

That doesn't inspire much confidence.

First as quadratic behaviour. I knew that its bounded by 2048 but as
Tulio told that there is no work done I was silent. As he told that
there is no work on that I realized that he either didn't read patch or
forgot that he reviewed part which calls strlen, strnlen and jumps to
generic strstr.

I and you read code and find 2048 constant weird. As reviewer I asked to 
clarify it but author just send unrelated benchmarks, never one that
would convince us. It is as on benchmark that I previously send on thread 
I show that its 100 times slower for that needle as I claimed. 
If Tulio was familiar with code he would suggest to decrease that constant to 32.

As quadratic its mostly shortcut as I don't want to repeat it. Its
beyond my understanding when one believes that in practice O(mn) algorithm 
with m=2048 could beat O(n) one. I raised that in all versions of
strstr/strcasestr.

Thats just simple issue, I raised that months ago so they had enough
time.

Then as matter of performance in general case I found several obvious
problems so why Tulio didn't? First is that strlen/strnlen is called
unnecessarily when first strchr call returns null. Thats hot path.
Second is that crosspage check is unnecessary as you could search for
ending character of needle.

These are arguments why I don't think it was reviewed.

Joseph, what is policy about assembly implementations when there is
better generic implementation? This assembly is slower than when do
simple optimization of C strstr, with patch. I repeatedly asked to
check it but never got reply. 

[PATCH] Improve generic strstr performance.

I submitted C implementation that beats assembly strstr previously in
thread.

To answer objections that benchmarks are inaccurate as strings in
practice are not random strings I used dryrun as I said earlier. I
recorded strstr calls mainly in gdb program.

I saved profile trace here for replicability.
http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2

With assembly I get following:

[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26567826
 took 26699020
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26608298
 took 26698638
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26752268
 took 26876982


replaying bash took 25841770

But when I replace that with generic improvement it improves perfromance
in practice.

[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25096738
 took 25187742
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25182840
 took 25255354
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25060592
 took 25165898




Main reason that I explained before is that it goes in considerable
lengths to optimizes cold path while not optimizing hot one. 
In random benchmark probability that you match k characters from needle 
decreases exponentially (Similar behaviour is observed in practice.) 

This patch for needles less than 8 bytes does rougthly following, larger
needles have more complicated code to check 8 bytes of needle at time.

nsize = strlen(needle)
if (strnlen(haystack,nsize) < nsize)
  return NULL;

if (nsize <= 9)
  {
    needle_start = *((uint64_t*)needle+1);
    mask = (~0) >> (9 - nsize);
    while (1)
      {
        haystack = strchr (haystack, needle[0]);
        if (*((uint64_t*)haystack) & mask == needle_start) 
          return haystack;
        haystack++;
      }

Problem is that optimizations like 

   if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
          return haystack;

Do more harm than good. Using simple byte check is faster in practice as
probably mismatch occurs in second character and overhead of testing
more characters is bigger than branch misprediction penalty. When I ran
profile trace then average number of digraphs and trigraphs is
following:

digraph count:  0.0126 trigraph count:  0.0024

On other hand it does nothing with hot strchr calls. If performance is
concern these should be at least inlined, or iterate over first
character mask.

I am only telling from my experience what optimizations will help.
Optimizations in this patch will not.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]