This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v2] Improve performance of strstr


On Wed, Apr 17, 2019 at 02:24:19PM +0000, Szabolcs Nagy wrote:
> On 16/04/2019 21:56, Rich Felker wrote:
> > On Tue, Apr 16, 2019 at 06:07:37PM +0000, Szabolcs Nagy wrote:
> >> On 16/04/2019 18:01, Szabolcs Nagy wrote:
> >>> On 16/04/2019 17:40, Szabolcs Nagy wrote:
> >>>> On 15/04/2019 21:15, Zack Weinberg wrote:
> >>>>> On Mon, Apr 15, 2019 at 2:02 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> >>>>>> Hi Rich,
> >>>>> ...
> >>>>>>>> Yes, without a reproducible example I can't see what your issue is. You
> >>>>>>>> can't make it go quadratic because it simply isn't.
> >>>>>>>
> >>>>>>> Obviously it's not unbounded because you have a (very large) bound on
> >>>>>>> the size, 256. I can make it do a 256-byte strcmp for nearly every
> >>>>>>> byte of the input haystack. Maybe because of vectorization on some
> >>>>>>> targets that's only 16x slower than the current code rather than 256x
> >>>>>>> slower, but it's still a lot slower.
> >>>>>>
> >>>>>> No you can't. It's impossible to make it do a full 256 byte memcmp every
> >>>>>> character. And bad cases are not bad because of the time spent comparing
> >>>>>> strings - they are bad because of mispredicted branches. So it's not possible
> >>>>>> to compare bad cases without benchmarking actual examples on modern
> >>>>>> CPUs.
> >>>>>
> >>>>> This discussion has been going in circles for quite some time now.
> >>>>>
> >>>>> Wilco, Rich, I think it would help a lot if you could BOTH write down
> >>>>> some example needle and haystack pairs that you believe will
> >>>>> demonstrate significantly improved performance with your preferred
> >>>>> algorithm, and/or pathologically slow performance with your
> >>>>> dispreferred algorithm.  Even without specific numbers, that will give
> >>>>> everyone something concrete to argue over, at least.
> >>>>
> >>>> since the new algorithm tries to look at the last two chars first
> >>>> i'd say a possible bad case for it is
> >>>>
> >>>> needle   =  250*"a" + "abbbaa"
> >>>> haystack = (250*"a" + "bbbbaa") * 1000
> >>>>
> >>>> (256 is the longest needle for the new algo, checking in a 256k
> >>>> haystack should be large enough to see matching performance
> >>>> instead of setup time)
> >>>>
> >>>> i think this should be close to worst case, but i haven't done
> >>>> a detailed analysis, the regression with the new algorithm is
> >>>>
> >>>> 16.0x slower on Cortex-A72
> >>>> 17.8x slower on Cortex-A57
> >>>
> >>> scratch that, with
> >>>
> >>> needle   =  248*"a" + "abbbbaaa"
> >>> haystack = (248*"a" + "bbbbbaaa") * 1000
> >>>
> >>> i get
> >>>
> >>> 37.8x slower on Cortex-A72
> >>> 40.0x slower on Cortex-A57
> >>
> >> this is a good case for twoway, so we need a twoway worst case too
> >> for a real comparision: comparing the worst for new vs worst for
> >> twoway i've seen so far, new is
> >>
> >> 4.5x slower on Cortex-A72
> >> 2.7x slower on Cortex-A57
> >>
> >> but there is no guarantee that my inputs are near the real worst
> >> cases, it's hard to tell (and clearly uarch matters too).
> >>
> >> (i wanted to avoid diving into the algorithm details to find
> >> worst cases)
> > 
> > There've been a series of mitigations analogous to someone 'fixing' an
> > SQLi vuln by filtering for SQL keywords or single-quote characters in
> > the input rather than actually fixing the bug, burying the issue and
> > making it harder to find the worst-cases that still exist. That's why
> > I'm frustrated with this whole topic/proposal.
> 
> in this case a large slowdown is not a real vulnerability.

I agree the bounded magnitude does not make it a vulnerability. Rather
it's just a QoI regression. I made the analogy with SQLi not to imply
that it's a vuln, but to critique the practice of burying a problem by
making it hard to work out which specific input case hits the problem,
rather than actually fixing the problem.

> i think even doing memcmp at every byte position is not a
> denial of service concern if the needle is <= 256 bytes.
> (maybe we should use a lower limit? <= 64?)
> 
> so only 'fixing' bad cases statistically is a valid approach.

If you recall, users were upset about glibc's math functions being
tens or hundreds of times slower on particular inputs because of
correct rounding. AIUI these were not inputs that were found by
attempts to perform DoS, just randomly hit in the real world. Adding
such serious input-dependent variance in performance is not a good
thing.

> > Given a bound on the needle size it's used for, this is fundamentally
> > a finite problem (albeit a pretty damn large one), so there likely are
> > "heuristic" solutions sufficiently complete to cover all cases.
> > However, even if one can be found, who wants to look at or validate
> > the argument/proof that it covers them all? Or run the fuzzers looking
> > for mistakes? Etc.
> 
> a mistake is not a disaster in this case, we just want to

Mistakes can easily turn into correctness bugs too. I'm not clear on
whether the current patch version has a bug or not, but it looks
plausible that it has false positives on hash collisions. When you
pile on heuristic mitigations like this, they become bug surface.

> avoid hitting 'memcmp per byte' behaviour on practical
> workloads.

Especially for memmem, all workloads are "practical".

> of course 'practical workload' is hard to reason about,
> which is why an algorithm that's provably better on all
> inputs would be preferred.
> 
> > The right way forward here, if there's actually any performance issue
> > to care about improving on, is to study why the current two-way
> > implementation is slower than expected and make changes that improve
> > performance in all cases.
> 
> i agree that would be nicer, but that's harder to do.

Wilco has already highlighted a few bad things it's doing that could
be improved without much work. After making those improvements, we
could then evaluate whether there are any cases where there's a
compelling advantage to the proposed quadratic version. I think what
we'd find is that it would be better on average, and probably better
for all inputs, for sizes where quadratic might win, to do a rolling
compare (like the current 2- to 4-byte versions in musl) using a
vector register (up to whatever size that is). On machines without
wide vector registers, redundant strcmp is going to be really slow,
and two-way is going to win even at small needle sizes.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]