This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] Use Quicksearch in strstr
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: 'GNU C Library' <libc-alpha at sourceware dot org>, nd <nd at arm dot com>, Alexander Monakov <amonakov at ispras dot ru>
- Date: Tue, 6 Nov 2018 13:02:27 +0000
- Subject: Re: [PATCH] Use Quicksearch in strstr
- References: <DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com>,<20181029225421.GA21482@domone>,<DB5PR08MB1030CE58A51EDD8FE90D80DD83CD0@DB5PR08MB1030.eurprd08.prod.outlook.com>
Rich Felker wrote:
> This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> in some cases, it's just a matter of getting lucky, and most likely
> due to bad tuning or missed optimizations in the implementation of the
> algorithm you're comparing it against. It's fundamentally a naive
> O(nm) search that will be this bad in all cases except where the
> bad-character table happens to let you advance a lot.
I disagree. What's important is real world performance. Practically all
good search algorithms are O(nm) and they beat any O(n) algorithm by
huge factors on actual inputs. That should teach you that a concept like
the O notation does not guarantee good performance.
> Do you have a basis for calling this "slow"? It should be roughly one
> comparison per byte. "Only 3x faster" doesn't sound slow to me.
One comparison per byte is slow indeed. Modern search algorithms
are sublinear and achieve O(n/m) comparisons. You can see this when
benchmarking - performance actually increases with longer needles.
>> I don't think there is any gain slowing down common cases in order to
>> reduce the worst-case difference (assuming that is feasibe).
> This argument gets made about strstr every couple years, and every
> time it's wrong. It's not that "common cases" (as a general class) are
> faster with any naive algorithm. Only very specific cases will ever be
> faster. Most common cases are roughly the same speed as with a good
> algorithm, and some common cases, plus a lot more uncommon ones, are
> catastrophically slower with a bad algorithm.
Actually it's the correct argument. Common cases are indeed faster with
modern algorithms - slow algorithms like Two-way can't speed up those
common cases. There is only very specific small set of inputs which can
trigger worst-case behaviour, and those are never typical inputs. And it's
easy to fall back to a different algorithm in such a case.
> To my knowledge there is no good test for a case that won't become
> catastrophically slower with a bad algorithm. If there is, I would be
> very surprised if the same test doesn't naturally translate into an
> optimized setup for two-way.
Actually it's trivial to find the worst case for each algorithm, you just ensure
you hit the slowest path all the time.
> Calling blatant undefined behavior "a theoretical issue" is not
> something we should be doing in 2018.
It's 100% theoretical - or you can show an example that could fail?
It's crazy to even bring this stuff up in the 21th century. Software that does
not rely on undefined behaviour one way or another simply does not exist.
So the question becomes, is that a fault with all our software or are the language
standards simply decades behind common practice?