[PATCH] powerpc: strcasestr optimization

Ondřej Bílka neleai@seznam.cz
Mon Jun 1 17:47:00 GMT 2015


On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
> On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > 
> > > This patch optimizes strcasestr function for power >= 7 systems.
> > > This patch uses optimized strlen and strnlen for calculating
> > > string length and the average improvement of this optimization is ~40%.
> > > This patch is tested on powerpc64 and powerpc64le.
> > > Attached the benchresults with this new patch.
> > > 
> > Thats not enough. As strcasestr that I submited is around three times
> > slower your implementation would likely be regression over generic one.
> > 
> > A problem here is that you use moronic algorithm. Fix algorithm first
> > before trying to optimize it.
> > 
> 
> This is not very helpful. You are demanding changes without clear
> explanation and justification.
> 
> What is wrong with Raja's algorithm? What is insufficient in the
> benchmark data she has provided? And why do you think your specific
> design applies to PowerISA and POWER7/POWER8 micro-architecture.
> 
> What data do you have that justified this objection?

I replied on strstr patch thread on why what she submitted is
performance regression. So I will repeat arguments from other thread
which still apply.

First was problem with quadratic behaviour. She tried to fix it but it
isn't a fix at all. Just benchmark 

strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")

That call would take around hundred times than before which is
unacceptable.

If we ignore that red flag second problem was that benchmark she used is
bogus. It test with periodic haystacks, needle is copy of first bytes of
haystack with last byte set to something else.

1234567890abcxyz1234567890abcxyz1234567890abcxyz...

These are far from real inputs. Periodicity could introduce artefacts in
algorithms (like that its worst case for my digraph search which relies
on fact that in most strings characters are relatively independent.)
My worst fear that some implementation would be ten times faster than in
practice. There could be branch that is normally unpredictable. But as
input is regular it could be alternatively taken and not taken which cpu
recognizes and will predict it perfectly.

This isn't main problem which is that result of benchmark depends on
parameter and each submitter could make his benchmark best by selecting
appropriate value of that parameter. This could apply by selecting
period to string n or if we fix previous paragraph by using random
strings by selecting probability of character.

Performance would differ by order of magnitude depending on
choice of that n. A simple comparison for strstr would be compare
two-way algorithm with naive algorithm that looks like

while (s = strchr(s, n[0]))
  check_rest

If you set n=2 or random strings with only a or b then strchr would
cause big overhead per call and it could be five times slower than
two-way algorithm whose performance is mostly constant.

On other hand with n=255 or using all values a strchr would quickly skip
256 bytes on average and be say 5 times faster than two-way algorithm.

If you draw performance characteristic they meet at same point where
they run at same speed. If we assume that random strings model real ones
well how can you be sure that you selected n constiderably less than
encountered in real programs and it shows wrong implementation as best?

Just use same patch like I send with ((unsigned) rand())%16 + 1 and you
will see completely different numbers in benchmark.

I didn't touched that benchmarks couldn't produce a correct result due
additional factors, like that in practice most inputs are small so a
branch prediction in implementation header matters a lot. But this
benchmark doesn't measure it as it calls functions with same arguments
over and over again which makes these branches.

That briefly covers what was wrong with benchmark.

As microarchitecture one doesn't need to know details to see that
something is fundamentally wrong. A algorithm here is essentially
byte-by-byte check which essentially does following

uint64_t haystack_window, needle_window;
while (!end)
  {
    haystack_window = haystack_window << 8 | tolower (haystack[i])
    if (haystack_window == needle_window)
      match_rest
    i++;
  }

A problem of that is that operations there are useless most of time. For
original benchmark 15 times out of 16 you don't have to use shifts and
other instructions to compare needle with haystack. At that position
they differ on first byte. So her algorithm is worse than naive doing
following with same optimizations as in previous case.

while (!end)              
  {                       
    if (tolower(haystack[i]) ==  tolower(needle[0]))
      match_rest
    i++;
  }

As that was covered my final comment was that it isn't substantial
speedup. My implemantation is generic and gives large boost 
so why should be powerpc different.

As I don't have powerpc access now apply my patches

[PATCH v5] Generic string skeleton
[PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem)

and run (preferably fixed) benchmark with these. As gains that I see on
x64 are bigger than ones gained by this assembly you will likely see
that generic implementation is indeed better and it would be pointless
to try review that only to remove it shortly after adding to improve
performance.



More information about the Libc-alpha mailing list