This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Use Quicksearch in strstr

Rich Felker wrote:
> On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote:

>> That's not true at all - in reality there is hardly any difference in worst case 
>> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on 
>> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way
>> is only about 16% faster. If the worst-case performance of Two-way was
>> deemed acceptable, then the worst-case of Quicksearch must be too.
> I don't know what figures you're talking about, but the worst-case for
> "quicksearch" with a bound of 256 bytes is easily over 100x slower
> than the worst case of two-way.

Not a chance. You do realize I am measuring this stuff for real rather than
just making up random numbers? Worst-case performance of both algorithms
is very similar for needles up to 256. It's not possible to get 10x difference, let
alone 100x.

>> Getting another 2-3x speedup after doubling performance is a huge factor.
>> The previous implementation was doing effectively nothing useful 75-85%
>> of the time...
> Then fix implementation bugs. Don't replace good algorithms with awful
> ones. For any given needle, it's definitely possible to make it such
> that,...

I already did that, remember? If you believe you can do better then post further
performance improvements.

>> > Two-way is essentially the most modern algorithm there is. Have you
>> > read any of the literature on this topic?
>> Two-way is almost 3 decades old now. It's not modern in any sense of the
>> word. I've looked at lots of papers of modern search algorithms, practically
>> all are about real-world performance (on English, DNA and gene sequences)
>> and are O(nm) - few papers consider real-time or constant-space searching.
> For DNA, quadratic-time is even worse.

The goal is to be fast on real data, not some made up worst case. Actual DNA
inputs don't hit bad quadratic cases. So researchers don't bother with the slower
linear algorithms.

> Quicksort performs *utterly abysmally* on plenty of needles shorter
> than 256 or even shorter than 50. You cannot use it based on the
> length of the needle except possibly for very short needles (maybe
> <16), and even then you will do worse on a lot of real-world cases.

Citation needed. You clearly haven't ever tried Quicksearch. It beats Two-way 
in every single case on real inputs. The length of the needle matters only for the
worst case, and by limiting the needle size, you avoid quadratic behaviour.

> You can do this if you're implementing a custom search for your
> application. You cannot do this for the libc strstr because it does
> not know what the application is. It needs to be able to perform
> non-pathologically for any inputs.

Limiting the needle size to avoid quadratic behaviour is perfectly fine for GLIBC
since the goal is to avoid denial of service via specially crafted inputs.

It's impossible for strstr to always have identical performance on all possible
inputs - practically every algorithm has fast and slow cases. Rabin-Karp is
perhaps an exception given a good hash function. It's also 10x faster than
Two-way slow cases, so it's a candidate to replace Two-way.

>> >> > Calling blatant undefined behavior "a theoretical issue" is not
>> >> > something we should be doing in 2018.
>> >> 
>> >> It's 100% theoretical - or you can show an example that could fail?
>> >
>> > With UBSan and LTO it should trap and abort.
>> I mean a real-world failure on an actual supported target, not something
>> that's not actually possible.
> This is a real-world near-future target, and introducing gratuitous
> breakage for it now is utterly backwards, especially when it's so easy
> to avoid (skipping the -- ahead of time and just adding a -1 to each
> []).

It seems unlikely it would ever work across dynamic library boundaries, but 
even if you replace the haystack input in strstr with a hardcoded array, UBSan
does not report a problem. The only way I could make it report anything at all
was to use hs = &s[-1].


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]