This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Use Quicksearch in strstr

From: Rich Felker <dalias at libc dot org>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
Cc: Alexander Monakov <amonakov at ispras dot ru>, 'GNU C Library' <libc-alpha at sourceware dot org>, nd <nd at arm dot com>
Date: Thu, 1 Nov 2018 12:42:35 -0400
Subject: Re: [PATCH] Use Quicksearch in strstr
References: <DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com> <alpine.LNX.2.20.13.1810291936040.19896@monopod.intra.ispras.ru> <DB5PR08MB1030B4FB720644283130A84A83CC0@DB5PR08MB1030.eurprd08.prod.outlook.com>

On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> Alexander Monakov wrote:
> > How does the worst case change look like (I think it's searching for "a{251}ba"
> > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > needle length cutoff further below 254?

This "quicksearch" is fundamentally a bad algorithm. If it looks fast
in some cases, it's just a matter of getting lucky, and most likely
due to bad tuning or missed optimizations in the implementation of the
algorithm you're comparing it against. It's fundamentally a naive
O(nm) search that will be this bad in all cases except where the
bad-character table happens to let you advance a lot.

> To be fair you have to compare the worst case of one algorithm vs the worst
> case of another since they are different... Two-way is slow on searching for
> ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> faster on this case despite having a theoretical factor of 256/2 advantage.

Do you have a basis for calling this "slow"? It should be roughly one
comparison per byte. "Only 3x faster" doesn't sound slow to me.

> I don't think there is any gain slowing down common cases in order to
> reduce the worst-case difference (assuming that is feasibe).

This argument gets made about strstr every couple years, and every
time it's wrong. It's not that "common cases" (as a general class) are
faster with any naive algorithm. Only very specific cases will ever be
faster. Most common cases are roughly the same speed as with a good
algorithm, and some common cases, plus a lot more uncommon ones, are
catastrophically slower with a bad algorithm.

To my knowledge there is no good test for a case that won't become
catastrophically slower with a bad algorithm. If there is, I would be
very surprised if the same test doesn't naturally translate into an
optimized setup for two-way.

> > +      do
> > +     {
> > +       hs--;
> 
> > Doesn't this move hs out of bounds and thus invoke undefined behavior?
> 
> That's a theoretical issue given *(a+(i-1)), *((a-1)+i) and *((a+i)-1) behave
> identically.

Calling blatant undefined behavior "a theoretical issue" is not
something we should be doing in 2018.

Rich

Follow-Ups:
- Re: [PATCH] Use Quicksearch in strstr
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]