This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] Improve performance of strstr
On 16/04/2019 17:40, Szabolcs Nagy wrote:
> On 15/04/2019 21:15, Zack Weinberg wrote:
>> On Mon, Apr 15, 2019 at 2:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>> Hi Rich,
>>>>> Yes, without a reproducible example I can't see what your issue is. You
>>>>> can't make it go quadratic because it simply isn't.
>>>> Obviously it's not unbounded because you have a (very large) bound on
>>>> the size, 256. I can make it do a 256-byte strcmp for nearly every
>>>> byte of the input haystack. Maybe because of vectorization on some
>>>> targets that's only 16x slower than the current code rather than 256x
>>>> slower, but it's still a lot slower.
>>> No you can't. It's impossible to make it do a full 256 byte memcmp every
>>> character. And bad cases are not bad because of the time spent comparing
>>> strings - they are bad because of mispredicted branches. So it's not possible
>>> to compare bad cases without benchmarking actual examples on modern
>> This discussion has been going in circles for quite some time now.
>> Wilco, Rich, I think it would help a lot if you could BOTH write down
>> some example needle and haystack pairs that you believe will
>> demonstrate significantly improved performance with your preferred
>> algorithm, and/or pathologically slow performance with your
>> dispreferred algorithm. Even without specific numbers, that will give
>> everyone something concrete to argue over, at least.
> since the new algorithm tries to look at the last two chars first
> i'd say a possible bad case for it is
> needle = 250*"a" + "abbbaa"
> haystack = (250*"a" + "bbbbaa") * 1000
> (256 is the longest needle for the new algo, checking in a 256k
> haystack should be large enough to see matching performance
> instead of setup time)
> i think this should be close to worst case, but i haven't done
> a detailed analysis, the regression with the new algorithm is
> 16.0x slower on Cortex-A72
> 17.8x slower on Cortex-A57
scratch that, with
needle = 248*"a" + "abbbbaaa"
haystack = (248*"a" + "bbbbbaaa") * 1000
37.8x slower on Cortex-A72
40.0x slower on Cortex-A57
i didn't expect such difference compared to the previous case,
i guess i would have to do a more detailed analysis.
> (for a fair comparison, the worst-case of the new code should
> be compared against the worst-case of the old code as well as
> comparing over common inputs for the average case)
> (the claimed average case improvement is 3.7x on Cortex-A72)
> if we are afraid the worst-case will be used for denial of
> service attack, then i think such slowdown is not enough to
> cause problems (and requires the control of both haystack and
> if we are afraid of hitting bad cases in practice, then the
> heuristic tweaks in the new matching logic should reduce the
> probability of bad cases with real needle/haystack. (but it
> does not prevent them, so on interactive systems one might
> occasionally experience larger latency than before)