This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #12100] strstr with unaligned loads.
- From: Rich Felker <dalias at aerifal dot cx>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 27 Aug 2013 04:45:00 -0400
- Subject: Re: [PATCH][BZ #12100] strstr with unaligned loads.
- Authentication-results: sourceware.org; auth=none
- References: <20130825090758 dot GA23686 at domone dot kolej dot mff dot cuni dot cz> <20130825155406 dot GV20515 at brightrain dot aerifal dot cx> <20130826070925 dot GA4292 at domone dot kolej dot mff dot cuni dot cz> <20130826162446 dot GZ20515 at brightrain dot aerifal dot cx> <20130827020414 dot GA8611 at domone dot kolej dot mff dot cuni dot cz> <20130827022042 dot GH20515 at brightrain dot aerifal dot cx> <20130827083048 dot GA12761 at domone dot kolej dot mff dot cuni dot cz>
On Tue, Aug 27, 2013 at 10:30:48AM +0200, OndÅej BÃlka wrote:
> > winning on random testcases is meaningless since it has quadratic
> > performance in the worst-case.
> > For strstr random input is always the
> > easy case.
>
> How good is it on binary alphabeth?
Two-way is very good on binary alphabet, and on 4-letter alphabet
(e.g. DNA). The bad character table optimization is rather useless on
small alphabets, of course, but that's not a core part of the
algorithm, just an optimization that helps achieve sublinear
performance in many real-world usages (while maintaining worst-case
linear performance).
Note that having an optimized search for the next occurrance of the
first character or two is also rather useless in a binary alphabet...
> > > > presumably because musl always uses the bad-character table). There
> > > > may be additional logic considerations to integrating any optimized
> > > > "next pair" search with the bad character table; I have not checked
> > > > this.
> > > >
> > > Bad character is also heuristic that is not needed when pair heuristic
> > > is used as it would only slow it down.
> >
> > Bad character table changes the typical runtime from O(n) to O(n/m).
> > This is HUGE.
>
> You are making basic mistake of arguing that some algorithm is better
> in best case.
My above comments were comparing two-way with bad character table to
plain two-way. Bad character table does not always help (it just helps
in many typical real-world cases), but it never hurts more than a
small O(m) (m = needle length) startup cost. In my implementation, I
combine this with the operation of computing the needle length. Of
course strlen is a faster O(m) operation by several times, so one
could argue that the bad character table still "costs too much", but
empirically I found my version of two-way (in musl) outperforming
glibc's version of two-way (not the SSE code in glibc) on many
reasonable inputs where the needle was 16-31 bytes.
> > > > On Sun, Aug 25, 2013 at 11:07:58AM +0200, OndÅej BÃlka wrote:
> > > > > This could be solved by standard buy or rent techique which I employ.
> > > > > For any position p there could be only p+64n+512 comparisons until I
> > > > > switch to two way algorithm which gives linear complexity.
> >
> > That's a reasonable solution, but I think it would be even better to
> > integrate your SSE tricks with two-way.
> >
> Its not worth doing from my perspective as factorization step cost
> around 20 cycles per needle byte. In that time I could often find three
> times that there is no match.
I think we'd need to see more empirical data to determine this. Your
switchover threshold is not exactly cheap. If the intent is just to
give good performance for "typical" (e.g. English text) uses of
strstr, and otherwise only avoid pathologically bad performance that
could be used for DoS, then your approach may be reasonable. On the
other hand, if you want to support arbitrary reasonable uses of strstr
(like DNA searches in an entire genome), it's probably not adequate.
But that's just my semi-educated guess. A real determination would
require empirical measurement.
Rich