This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Use Quicksearch in strstr

From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
To: Rich Felker <dalias at libc dot org>
Cc: 'GNU C Library' <libc-alpha at sourceware dot org>, nd <nd at arm dot com>, Alexander Monakov <amonakov at ispras dot ru>
Date: Tue, 6 Nov 2018 18:26:57 +0000
Subject: Re: [PATCH] Use Quicksearch in strstr
References: <DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com>,<20181029225421.GA21482@domone>,<DB5PR08MB1030CE58A51EDD8FE90D80DD83CD0@DB5PR08MB1030.eurprd08.prod.outlook.com>,<DB5PR08MB1030428E777D456220B6F09383CB0@DB5PR08MB1030.eurprd08.prod.outlook.com>

Rich Felker wrote:

> Real world performance includes performance under the control of a
> hostile input source, under which naive algorithms will perform
> hundreds of times slower. This "quicksearch" is incredibly slow on
> a{253}ba, and it's as slow as the most naive possible strstr whenever
> the last character is abundant in the haystack.

That's not true at all - in reality there is hardly any difference in worst case 
performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on 
Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way
is only about 16% faster. If the worst-case performance of Two-way was
deemed acceptable, then the worst-case of Quicksearch must be too.

>> good search algorithms are O(nm) and they beat any O(n) algorithm by
>> huge factors on actual inputs. That should teach you that a concept like
>> the O notation does not guarantee good performance.
>
> By a small factor at best, not a huge factor.

Getting another 2-3x speedup after doubling performance is a huge factor.
The previous implementation was doing effectively nothing useful 75-85%
of the time...

>> One comparison per byte is slow indeed. Modern search algorithms
>> are sublinear and achieve O(n/m) comparisons. You can see this when
>> benchmarking - performance actually increases with longer needles.
>
> Two-way of course achieves this sublinear performance for common
> cases, but it's only possible for a very limited set, not in general,
> and does not scale. In particular, once the length of the needle
> exceeds the size of the alphabet, it can't continue to scale this way.

Two-way is not sublinear as often as Quicksearch indeed, and that's an
inherent drawback of the algorithm. The alphabet size is not relevant, it's
about the occurence frequencies of characters in the input - a bad character
table happily jumps ahead by a million if a character does not appear in a
huge needle. It's just unlikely to happen frequently on typical inputs. So for
large inputs you don't see performance increasing further but level off.

> Two-way is essentially the most modern algorithm there is. Have you
> read any of the literature on this topic?

Two-way is almost 3 decades old now. It's not modern in any sense of the
word. I've looked at lots of papers of modern search algorithms, practically
all are about real-world performance (on English, DNA and gene sequences)
and are O(nm) - few papers consider real-time or constant-space searching.

>> common cases. There is only very specific small set of inputs which can
>> trigger worst-case behaviour, and those are never typical inputs. And it's
>> easy to fall back to a different algorithm in such a case.
>
> Citation needed. If it's easy to detect them, prove it. You'll find
> it's not. You can of course do an introsearch that counts the number
> of times it backtracks and switches to two-way, but by that time you
> will have spent many times the setup cost of two-way and it will be a
> lot slower.

It's trivial by keeping track of how many comparisons you do vs how much
progress you make, eg. comparisons_done > N * (cur_pos - start_pos) for
a fixed N. But you can make it even simpler and do what I did - needles
larger than 256 are rare and so their performance is less important.

>> Actually it's trivial to find the worst case for each algorithm, you just ensure
>> you hit the slowest path all the time.
>
> This is not a question of finding the worst case. It's a question of
> efficiently evaluating if a given case is one of the bad ones.

Why on earth would you want to do that dynamically? We can compare worst
cases offline and choose what algorithm to run for particular needle/haystack
sizes. And that is sufficient to ensure linear time in the worst case.

>> > Calling blatant undefined behavior "a theoretical issue" is not
>> > something we should be doing in 2018.
>> 
>> It's 100% theoretical - or you can show an example that could fail?
>
> With UBSan and LTO it should trap and abort.

I mean a real-world failure on an actual supported target, not something
that's not actually possible.

Wilco

Follow-Ups:
- Re: [PATCH] Use Quicksearch in strstr
  - From: Rich Felker
- Re: [PATCH] Use Quicksearch in strstr
  - From: Wilco Dijkstra

References:
- Re: [PATCH] Use Quicksearch in strstr
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]