This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Use Quicksearch in strstr


On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> 
> > Real world performance includes performance under the control of a
> > hostile input source, under which naive algorithms will perform
> > hundreds of times slower. This "quicksearch" is incredibly slow on
> > a{253}ba, and it's as slow as the most naive possible strstr whenever
> > the last character is abundant in the haystack.
> 
> That's not true at all - in reality there is hardly any difference in worst case 
> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on 
> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way
> is only about 16% faster. If the worst-case performance of Two-way was
> deemed acceptable, then the worst-case of Quicksearch must be too.

I don't know what figures you're talking about, but the worst-case for
"quicksearch" with a bound of 256 bytes is easily over 100x slower
than the worst case of two-way.

> >> good search algorithms are O(nm) and they beat any O(n) algorithm by
> >> huge factors on actual inputs. That should teach you that a concept like
> >> the O notation does not guarantee good performance.
> >
> > By a small factor at best, not a huge factor.
> 
> Getting another 2-3x speedup after doubling performance is a huge factor.
> The previous implementation was doing effectively nothing useful 75-85%
> of the time...

Then fix implementation bugs. Don't replace good algorithms with awful
ones. For any given needle, it's definitely possible to make it such
that,...

> >> One comparison per byte is slow indeed. Modern search algorithms
> >> are sublinear and achieve O(n/m) comparisons. You can see this when
> >> benchmarking - performance actually increases with longer needles.
> >
> > Two-way of course achieves this sublinear performance for common
> > cases, but it's only possible for a very limited set, not in general,
> > and does not scale. In particular, once the length of the needle
> > exceeds the size of the alphabet, it can't continue to scale this way.
> 
> Two-way is not sublinear as often as Quicksearch indeed, and that's an
> inherent drawback of the algorithm.

It's sublinear in a superset of the cases where "quicksearch" is
sublinear: whever you get to use the bad character table (which works
the same for both algorithms), *plus* when a mismatch in the left
factor lets you advance by a whole period. This latter class of cases
mostly helps when the period is large and the right factor is short,
but that's an important class of real-world cases (most needles are
non-periodic, i.e. period = strlen(needle), and the right factor is
very short). In some cases, the maximal-suffix algorithm selects a
gratuitously long right factor; it may be possible to optimize this
selection further.

> The alphabet size is not relevant, it's

I wasn't precise in stating what I meant here but it's not important.

> about the occurence frequencies of characters in the input - a bad character
> table happily jumps ahead by a million if a character does not appear in a
> huge needle. It's just unlikely to happen frequently on typical inputs. So for
> large inputs you don't see performance increasing further but level off.

Worst-case is never about what's likely to happen in the haystack but
what can happen.

> > Two-way is essentially the most modern algorithm there is. Have you
> > read any of the literature on this topic?
> 
> Two-way is almost 3 decades old now. It's not modern in any sense of the
> word. I've looked at lots of papers of modern search algorithms, practically
> all are about real-world performance (on English, DNA and gene sequences)
> and are O(nm) - few papers consider real-time or constant-space searching.

For DNA, quadratic-time is even worse.

> >> common cases. There is only very specific small set of inputs which can
> >> trigger worst-case behaviour, and those are never typical inputs. And it's
> >> easy to fall back to a different algorithm in such a case.
> >
> > Citation needed. If it's easy to detect them, prove it. You'll find
> > it's not. You can of course do an introsearch that counts the number
> > of times it backtracks and switches to two-way, but by that time you
> > will have spent many times the setup cost of two-way and it will be a
> > lot slower.
> 
> It's trivial by keeping track of how many comparisons you do vs how much
> progress you make, eg. comparisons_done > N * (cur_pos - start_pos) for
> a fixed N. But you can make it even simpler and do what I did - needles
> larger than 256 are rare and so their performance is less important.

Quicksort performs *utterly abysmally* on plenty of needles shorter
than 256 or even shorter than 50. You cannot use it based on the
length of the needle except possibly for very short needles (maybe
<16), and even then you will do worse on a lot of real-world cases.

> >> Actually it's trivial to find the worst case for each algorithm, you just ensure
> >> you hit the slowest path all the time.
> >
> > This is not a question of finding the worst case. It's a question of
> > efficiently evaluating if a given case is one of the bad ones.
> 
> Why on earth would you want to do that dynamically? We can compare worst
> cases offline and choose what algorithm to run for particular needle/haystack
> sizes. And that is sufficient to ensure linear time in the worst case.

You can do this if you're implementing a custom search for your
application. You cannot do this for the libc strstr because it does
not know what the application is. It needs to be able to perform
non-pathologically for any inputs.

> >> > Calling blatant undefined behavior "a theoretical issue" is not
> >> > something we should be doing in 2018.
> >> 
> >> It's 100% theoretical - or you can show an example that could fail?
> >
> > With UBSan and LTO it should trap and abort.
> 
> I mean a real-world failure on an actual supported target, not something
> that's not actually possible.

This is a real-world near-future target, and introducing gratuitous
breakage for it now is utterly backwards, especially when it's so easy
to avoid (skipping the -- ahead of time and just adding a -1 to each
[]).

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]