This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Use Quicksearch in strstr

From: Rich Felker <dalias at libc dot org>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Cc: 'GNU C Library' <libc-alpha at sourceware dot org>, nd <nd at arm dot com>, Alexander Monakov <amonakov at ispras dot ru>
Date: Tue, 6 Nov 2018 10:34:29 -0500
Subject: Re: [PATCH] Use Quicksearch in strstr
References: <DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com> <20181029225421.GA21482@domone> <DB5PR08MB1030CE58A51EDD8FE90D80DD83CD0@DB5PR08MB1030.eurprd08.prod.outlook.com> <DB5PR08MB1030428E777D456220B6F09383CB0@DB5PR08MB1030.eurprd08.prod.outlook.com>

On Tue, Nov 06, 2018 at 01:02:27PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> 
> > This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> > in some cases, it's just a matter of getting lucky, and most likely
> > due to bad tuning or missed optimizations in the implementation of the
> > algorithm you're comparing it against. It's fundamentally a naive
> > O(nm) search that will be this bad in all cases except where the
> > bad-character table happens to let you advance a lot.
> 
> I disagree. What's important is real world performance. Practically all

Real world performance includes performance under the control of a
hostile input source, under which naive algorithms will perform
hundreds of times slower. This "quicksearch" is incredibly slow on
a{253}ba, and it's as slow as the most naive possible strstr whenever
the last character is abundant in the haystack.

> good search algorithms are O(nm) and they beat any O(n) algorithm by
> huge factors on actual inputs. That should teach you that a concept like
> the O notation does not guarantee good performance.

By a small factor at best, not a huge factor.

> > Do you have a basis for calling this "slow"? It should be roughly one
> > comparison per byte. "Only 3x faster" doesn't sound slow to me.
> 
> One comparison per byte is slow indeed. Modern search algorithms
> are sublinear and achieve O(n/m) comparisons. You can see this when
> benchmarking - performance actually increases with longer needles.

Two-way of course achieves this sublinear performance for common
cases, but it's only possible for a very limited set, not in general,
and does not scale. In particular, once the length of the needle
exceeds the size of the alphabet, it can't continue to scale this way.

> >> I don't think there is any gain slowing down common cases in order to
> >> reduce the worst-case difference (assuming that is feasibe).
> >
> > This argument gets made about strstr every couple years, and every
> > time it's wrong. It's not that "common cases" (as a general class) are
> > faster with any naive algorithm. Only very specific cases will ever be
> > faster. Most common cases are roughly the same speed as with a good
> > algorithm, and some common cases, plus a lot more uncommon ones, are
> > catastrophically slower with a bad algorithm.
> 
> Actually it's the correct argument. Common cases are indeed faster with
> modern algorithms - slow algorithms like Two-way can't speed up those

Two-way is essentially the most modern algorithm there is. Have you
read any of the literature on this topic?

> common cases. There is only very specific small set of inputs which can
> trigger worst-case behaviour, and those are never typical inputs. And it's
> easy to fall back to a different algorithm in such a case.

Citation needed. If it's easy to detect them, prove it. You'll find
it's not. You can of course do an introsearch that counts the number
of times it backtracks and switches to two-way, but by that time you
will have spent many times the setup cost of two-way and it will be a
lot slower.

> > To my knowledge there is no good test for a case that won't become
> > catastrophically slower with a bad algorithm. If there is, I would be
> > very surprised if the same test doesn't naturally translate into an
> > optimized setup for two-way.
> 
> Actually it's trivial to find the worst case for each algorithm, you just ensure
> you hit the slowest path all the time.

This is not a question of finding the worst case. It's a question of
efficiently evaluating if a given case is one of the bad ones.

> > Calling blatant undefined behavior "a theoretical issue" is not
> > something we should be doing in 2018.
> 
> It's 100% theoretical - or you can show an example that could fail?

With UBSan and LTO it should trap and abort.

> It's crazy to even bring this stuff up in the 21th century. Software that does
> not rely on undefined behaviour one way or another simply does not exist.
> So the question becomes, is that a fault with all our software or are the language
> standards simply decades behind common practice?

This kind of claim is not making your case.

Rich

Follow-Ups:
- Re: [PATCH] Use Quicksearch in strstr
  - From: Joseph Myers

References:
- Re: [PATCH] Use Quicksearch in strstr
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]