This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Use Quicksearch in strstr


On Thu, Nov 01, 2018 at 03:27:38PM -0400, Rich Felker wrote:
> On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote:
> > On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> > > Alexander Monakov wrote:
> > > > How does the worst case change look like (I think it's searching for "a{251}ba"
> > > > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > > > needle length cutoff further below 254?
> > 
> > This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> > in some cases, it's just a matter of getting lucky, and most likely
> > due to bad tuning or missed optimizations in the implementation of the
> > algorithm you're comparing it against. It's fundamentally a naive
> > O(nm) search that will be this bad in all cases except where the
> > bad-character table happens to let you advance a lot.
> > 
> > > To be fair you have to compare the worst case of one algorithm vs the worst
> > > case of another since they are different... Two-way is slow on searching for
> > > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> > > faster on this case despite having a theoretical factor of 256/2 advantage.
> > 
> > Do you have a basis for calling this "slow"? It should be roughly one
> > comparison per byte. "Only 3x faster" doesn't sound slow to me.
> 
> I think there may actually be a bug (just suboptimality) in the
> two-way implementation related to this case. It's a periodic needle
> that uses 'memory', and glibc has (abbreviated):
> 
> 	if (memory && shift < period)
> 		shift = needle_len - period;
> 
> From what I recall (having implemented the same), this is to ensure
> that it doesn't advance too little, requiring re-comparing a segment
> already compared as the right-hand factor. However, it's possible that
> shift is already larger than needle_len - period, and in that case I
> see no reason to use the memory rather than advancing by the longer
> amount.
> 
> It's certainly not true in general that the shift of a given byte is
> bounded by needle_len - period. For example, zabc...xyz factors as
> [z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is
> 1. Of course in this case, after the truncated shift by 1, the memory
> will be gone, and the next shift won't be truncated, so it might not
> matter in the total run time.
> 
> Anyone else interested in looking at this to confirm if my findings
> make sense?

I'm pretty sure this is a logic bug. Except for bytes that don't
appear in the needle (for which shift==needle_len), shift < period is
automatically true. The needle being periodic means that each byte
which appears in it appears within the last 'period' positions of it.

I suspect what was actually intended was:

	if (memory && shift < needle_len - period)
		shift = needle_len - period;

which can be simplified to:

	if (shift < memory)
		shift = memory;

Is this okay?

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]