This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] Improve performance of strstr
On Wed, Apr 17, 2019 at 12:54:05AM +0000, Wilco Dijkstra wrote:
> Hi Rich,
> On Tue, Apr 16, 2019 at 11:49:43PM +0000, Wilco Dijkstra wrote:
> > Hi Rich,
> > > Ultimately the 3 key questions are:
> > >
> > > 1. Is it linear time even on huge inputs?
> > > 2. Is it faster than Two-way in the average case?
> > > 3. Is it within a small factor even in the worst case?
> > >
> > > The first 2 were never in doubt, and your result proves 3 as well.
> > > His result was up to 40x slower, which refutes #3.
> > No that was incorrectly comparing a fast case with a slow case. You can
> > easily show 1000x difference between fast and slow cases, but what we're
> > interested in is how the slow cases compare.
> > No, the 40x was comparing the same input -- that is, there's an input
> > for which your patch makes it 40x slower.
> The same input can be fast for one algorithm and slow on another. Is that
> hard to understand? Like I said it can be 1000 times. And it goes both ways
> so it doesn't matter at all.
> > You can find some inputs which are mildly slow for the current two-way
> > implementation (likely due to a mix of fundamental reasons and
> > implementation flaws), and normalized for size (a poor normalization
> > but whatever), searching for them takes significantly more than 1/40
> > the time that your code takes on its worst-case. This is really not a
> Yes it's very common to get bad factorizations, but every now and again it
> gets one that starts with an uncommon character and it ends up really fast.
> That huge disparity means it's a bad algorithm.
No it doesn't. It's unfortunate that the factorization depends on the
choice of order on the alphabet. This can be mitigated by ordering the
alphabet according to first or last appearance in the needle, and
indeed doing so tends to improve things. But even without that, the
difference between a "good" and "bad" choice is small. When it's not
small, the cause is inefficient code paths in the implementation, not
anything fundamental to the algorithm.
> Yes the implementation is insanely stupid with the large and small needle
> code reversed. For small needles you simply want to use a skip table since
> factorization can't do anything useful anyway.
Arguably, on an arch with vector registers, this is roughly correct,
for glibc's large/small cutoff of 32 bytes. However keep in mind that
in a worst case the skip table is always useless. The way you make an
efficient strstr for needles that fit in registers is just rotating in
a byte at a time and comparing the whole register. I'm not sure if you
can tack the skip table onto this in a way that works well.
> For large needles you can get
> reasonable factorizations and so a skip table actually slows things down.
Unfortunately the memset to make the skip table is somewhat costly for
small needles, in terms of blowing cache lines even if not raw cycles
under an artificial benchmark. musl's version mitigates this with a
bit array marking the valid entries rather than initializing them all,
but it might be better (if you don't mind code size, which glibc
doesn't) to have a different version for needles <256 bytes in which
each skip entry is a single byte.
> > meaningful comparison. What's meaningful is that some operations
> > become 40x slower for no good reason except your irrational refusal to
> > improve the implementation of the good algorithm rather than reverting
> > to something backwards.
> A 40x slowdown vs a rare lucky fast case for Two-way is not in any way
It's not a "lucky" fast case. It's the classic pattern of cases that
are awful for O(nm) algorithms. That a non-idiotic algorithm isn't
slow on this is not luck. It's the whole point.
> meaningful. Being 11 times faster than the optimized Two-way on the
> average case certainly is.
It's not 11x faster, and the two-way in glibc is not at all optimized,
but does lots of things in glaringly, gratuitously slow ways.