This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Use Quicksearch in strstr


On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote:
> On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> > Alexander Monakov wrote:
> > > How does the worst case change look like (I think it's searching for "a{251}ba"
> > > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > > needle length cutoff further below 254?
> 
> This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> in some cases, it's just a matter of getting lucky, and most likely
> due to bad tuning or missed optimizations in the implementation of the
> algorithm you're comparing it against. It's fundamentally a naive
> O(nm) search that will be this bad in all cases except where the
> bad-character table happens to let you advance a lot.
> 
> > To be fair you have to compare the worst case of one algorithm vs the worst
> > case of another since they are different... Two-way is slow on searching for
> > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> > faster on this case despite having a theoretical factor of 256/2 advantage.
> 
> Do you have a basis for calling this "slow"? It should be roughly one
> comparison per byte. "Only 3x faster" doesn't sound slow to me.

I think there may actually be a bug (just suboptimality) in the
two-way implementation related to this case. It's a periodic needle
that uses 'memory', and glibc has (abbreviated):

	if (memory && shift < period)
		shift = needle_len - period;

>From what I recall (having implemented the same), this is to ensure
that it doesn't advance too little, requiring re-comparing a segment
already compared as the right-hand factor. However, it's possible that
shift is already larger than needle_len - period, and in that case I
see no reason to use the memory rather than advancing by the longer
amount.

It's certainly not true in general that the shift of a given byte is
bounded by needle_len - period. For example, zabc...xyz factors as
[z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is
1. Of course in this case, after the truncated shift by 1, the memory
will be gone, and the next shift won't be truncated, so it might not
matter in the total run time.

Anyone else interested in looking at this to confirm if my findings
make sense?

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]