This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH][BZ #12100] strstr with unaligned loads.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Rich Felker <dalias at aerifal dot cx>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 27 Aug 2013 10:30:48 +0200
- Subject: Re: [PATCH][BZ #12100] strstr with unaligned loads.
- Authentication-results: sourceware.org; auth=none
- References: <20130825090758 dot GA23686 at domone dot kolej dot mff dot cuni dot cz> <20130825155406 dot GV20515 at brightrain dot aerifal dot cx> <20130826070925 dot GA4292 at domone dot kolej dot mff dot cuni dot cz> <20130826162446 dot GZ20515 at brightrain dot aerifal dot cx> <20130827020414 dot GA8611 at domone dot kolej dot mff dot cuni dot cz> <20130827022042 dot GH20515 at brightrain dot aerifal dot cx>
On Mon, Aug 26, 2013 at 10:20:42PM -0400, Rich Felker wrote:
> On Tue, Aug 27, 2013 at 04:04:14AM +0200, OndÅej BÃlka wrote:
> > > Note that the above-quoted code is only for the short-needle case
> > > without the bad character table, which I also claim should be removed
> > > (empirically, glibc is much slower than musl on short needles,
> > False, on random testcases sse4.2 often beats musl by 50% factor. I also
> > tested musl on noise and it is 20% slower due of factorization overhead.
> I'm assuming use of the two-way strstr code in glibc. The sse4.2 code
Its musl two-way strstr as was written here:
> > False, on random testcases sse4.2 often beats musl
> winning on random testcases is meaningless since it has quadratic
> performance in the worst-case.
> For strstr random input is always the
> easy case.
How good is it on binary alphabeth?
> > > presumably because musl always uses the bad-character table). There
> > > may be additional logic considerations to integrating any optimized
> > > "next pair" search with the bad character table; I have not checked
> > > this.
> > >
> > Bad character is also heuristic that is not needed when pair heuristic
> > is used as it would only slow it down.
> Bad character table changes the typical runtime from O(n) to O(n/m).
> This is HUGE.
You are making basic mistake of arguing that some algorithm is better
in best case. There is no guarantee that user will not ask for needle
larger than 16 nor could enlarge needles to 100 bytes. As most needles
in practice are at most 8 bytes and getting something more is unlikely
> For example, searching for a 100-byte needle in a
> 10000-byte haystack can often complete with just 100 comparisons.
You forgotten that 100 byte needle will fill a shift shift table a lot,
for english it could decrease success probability to 5% or less.
> > There is a lower bound that you need to inspect each character to check
> > if it is zero.
> Fair enough, though this does not apply to memmem...
> You could also cheat and allow reading past the end of the string up
> to the next page boundary, but valgrind users would hate you for
> this... ;-)
> > Provided that pairs are unlikely I could get within
> > factor of 2 of that lower bound. I traded here worse benchmark
> > performance for better real-time performance. For benchmark-oriented I
> > would seek for triples which will cut most of misprediction.
> If you can update two-way for C strings to integrate the pair search
> with the end-of-string checking, I think you could have an unbeatable
> solution. Right now, glibc just has an unbeatably slow solution, and
> fake benchmarks (random input) making it look fast. Next time try
> benchmarking searches for a^Mb in a^Nb or something else hard.
> > > > And most time first character will not match and in cases when it will
> > > > second won't most of time.
> > >
> > > The case you have to worry about is when they do match every single
> > > time, and whether this increases the runtime from linear to quadratic.
> > >
> > Please read mails before replying to them, from original mail.
> > > On Sun, Aug 25, 2013 at 11:07:58AM +0200, OndÅej BÃlka wrote:
> > > > This could be solved by standard buy or rent techique which I employ.
> > > > For any position p there could be only p+64n+512 comparisons until I
> > > > switch to two way algorithm which gives linear complexity.
> That's a reasonable solution, but I think it would be even better to
> integrate your SSE tricks with two-way.
Its not worth doing from my perspective as factorization step cost
around 20 cycles per needle byte. In that time I could often find three
times that there is no match.