This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] Improve performance of strstr
On 15/04/2019 21:15, Zack Weinberg wrote:
> On Mon, Apr 15, 2019 at 2:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> Hi Rich,
>>>> Yes, without a reproducible example I can't see what your issue is. You
>>>> can't make it go quadratic because it simply isn't.
>>> Obviously it's not unbounded because you have a (very large) bound on
>>> the size, 256. I can make it do a 256-byte strcmp for nearly every
>>> byte of the input haystack. Maybe because of vectorization on some
>>> targets that's only 16x slower than the current code rather than 256x
>>> slower, but it's still a lot slower.
>> No you can't. It's impossible to make it do a full 256 byte memcmp every
>> character. And bad cases are not bad because of the time spent comparing
>> strings - they are bad because of mispredicted branches. So it's not possible
>> to compare bad cases without benchmarking actual examples on modern
> This discussion has been going in circles for quite some time now.
> Wilco, Rich, I think it would help a lot if you could BOTH write down
> some example needle and haystack pairs that you believe will
> demonstrate significantly improved performance with your preferred
> algorithm, and/or pathologically slow performance with your
> dispreferred algorithm. Even without specific numbers, that will give
> everyone something concrete to argue over, at least.
since the new algorithm tries to look at the last two chars first
i'd say a possible bad case for it is
needle = 250*"a" + "abbbaa"
haystack = (250*"a" + "bbbbaa") * 1000
(256 is the longest needle for the new algo, checking in a 256k
haystack should be large enough to see matching performance
instead of setup time)
i think this should be close to worst case, but i haven't done
a detailed analysis, the regression with the new algorithm is
16.0x slower on Cortex-A72
17.8x slower on Cortex-A57
(for a fair comparison, the worst-case of the new code should
be compared against the worst-case of the old code as well as
comparing over common inputs for the average case)
(the claimed average case improvement is 3.7x on Cortex-A72)
if we are afraid the worst-case will be used for denial of
service attack, then i think such slowdown is not enough to
cause problems (and requires the control of both haystack and
if we are afraid of hitting bad cases in practice, then the
heuristic tweaks in the new matching logic should reduce the
probability of bad cases with real needle/haystack. (but it
does not prevent them, so on interactive systems one might
occasionally experience larger latency than before)