Bug 12100 - QoI regression: strstr() slowed from O(n) to O(n^2) on SSE4 machines
QoI regression: strstr() slowed from O(n) to O(n^2) on SSE4 machines
Status: NEW
Product: glibc
Classification: Unclassified
Component: libc
2.11
: P2 normal
: 2.17
Assigned To: Not yet assigned to anyone
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-06 18:10 UTC by Eric Blake
Modified: 2012-12-11 11:48 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2010-10-07 03:20:01


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Blake 2010-10-06 18:10:46 UTC
In glibc 2.9, strstr() was rewritten from a quadratic to a linear algorithm.  However, in glibc 2.11, an SSE4 assembly version was added which speeds up searches over short needles, but re-introduces the quadratic behavior of long needles.

The best approach would be to bound the SSE4 brute-force assembly implementation to a maximum needle length (perhaps 8 or 32 bytes), and defer to the linear algorithm for long needles.  This has the benefits of avoiding the startup overhead inherent in the linear algorithm (the overhead is proportionately worse the shorter the needle is), while avoiding the quadratic scaling for large needles (which are statistically less common, but all the more important to avoid scaling effects).
Comment 1 Ulrich Drepper 2010-10-07 03:20:01 UTC
I don't like distinguishing based on the length of the needles.  But I might consider it as an interim solution.

Regardless, this is something for somebody who cares about these cases to write a patch for.  In real life this really isn't important.  Change the status when you come up with a patch.
Comment 2 Rich Felker 2012-05-18 03:10:45 UTC
If this bug still exists, it's valid, and should be fixed simply by removing the SSE4 "optimization".

It's also worth noting, since Drepper mentioned the ugliness of special-casing, that the special-casing of "short" (<32 byte) needles in the current two-way code simply hurts performance. It's intended to avoid the 1-2 kb memset-zero at startup, but even with this startup overhead cost, performance is still better with the "long needle" variant of the code. I did a lot of performance testing developing musl's strstr, and concluded that the optimal algorithm for very-short needles (<=4 bytes) is rolling perfect hash in a machine word, and that for longer needles, twoway with the bad character table always wins. I also avoided the 1-2 kb memset by keeping a 256-bit table of which entries in the 1-2 kb table are valid. This eliminates the startup overhead at the expense of a few cycles in the inner loop.
Comment 3 Carlos O'Donell 2012-06-28 13:46:25 UTC
We want this fixed in 2.17.

We expect that Maxim[1] or Ondrej's[2] patches will fix this.

[1] http://sourceware.org/ml/libc-alpha/2012-05/msg01228.html

[2] http://sourceware.org/ml/libc-alpha/2012-06/msg00027.html
Comment 4 Maxim Kuvyrkov 2012-06-28 22:29:51 UTC
As things stand, my patches ([1] quoted above) won't address the O(n^2) -> O(n) problem for SSE4 machines -- they fix BZ #11607 instead.

Ondrej's patches, on the other hand, are supposed to fix the O(n^2) problem for x86, but they haven't been posted yet.
Comment 5 Eric Blake 2012-06-28 22:41:29 UTC
(In reply to comment #3)
> We want this fixed in 2.17.
> 
> We expect that Maxim[1] or Ondrej's[2] patches will fix this.
> 
> [1] http://sourceware.org/ml/libc-alpha/2012-05/msg01228.html
> 
> [2] http://sourceware.org/ml/libc-alpha/2012-06/msg00027.html

Note that [2] was just a summary of Ondrej's benchmarking; searching for an actual patch of his turns up this link:

http://sourceware.org/ml/libc-help/2011-11/msg00011.html
Comment 6 Ondrej Bilka 2012-06-28 23:54:40 UTC
I will submit a patch after freeze ends. I made several performance
improvements from what I suggested.
On Thu, Jun 28, 2012 at 10:41:29PM +0000, eblake at redhat dot com wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=12100
> 
> --- Comment #5 from Eric Blake <eblake at redhat dot com> 2012-06-28 22:41:29 UTC ---
> (In reply to comment #3)
> > We want this fixed in 2.17.
> > 
> > We expect that Maxim[1] or Ondrej's[2] patches will fix this.
> > 
> > [1] http://sourceware.org/ml/libc-alpha/2012-05/msg01228.html
> > 
> > [2] http://sourceware.org/ml/libc-alpha/2012-06/msg00027.html
> 
> Note that [2] was just a summary of Ondrej's benchmarking; searching for an
> actual patch of his turns up this link:
> 
> http://sourceware.org/ml/libc-help/2011-11/msg00011.html
> 
> -- 
> Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
Comment 7 Eric Blake 2012-06-29 00:01:42 UTC
(In reply to comment #6)
> I will submit a patch after freeze ends. I made several performance
> improvements from what I suggested.

Can you please submit the patch now, so we can at least look at it, even if it is not polished?  I know last month the list said we might wait until after 2.17 to review it, but...

> > (In reply to comment #3)
> > > We want this fixed in 2.17.

...this statement makes me believe that people are starting to realize the severity of the regression, and want it fixed sooner.
Comment 8 Rich Felker 2012-06-29 00:09:10 UTC
The regression should be fixed immediately by disabling the new SSE code entirely and simply using the fast C implementation until someone can produce SSE code without the O(n^2) pathology. In reality, SSE will not help you for long needles (where 2way algo excels, and even approaches O(n/m) in most real-world cases) so the SSE code should probably only be invoked when the needle is sufficiently short.

As for the benchmarks, they seem to be from when the SSE code was first introduced, and serve as propaganda showing how fast it is in best-case situations. This is a rather useless benchmark. A real benchmark of strstr will test things like searching for {a}^n in {{a}^{n-1}b}^n{a}^n as well as more complicated periodicity cases that stress-test needle factorization. Testing the best-case is never a valid benchmark; you have to test worst-case.
Comment 9 Liubov Dmitrieva 2012-06-29 12:57:25 UTC
SSE42 strstr is quadratic at worst but still better in general cases than current 2way algo.

If new Ondrej's SSE algorithm or current 2way version with applied path [1] will be better for real life cases then SSE42 strstr should be eliminated.

But I agree with Ulrich that for strstr long needles is not important in real life. People use memmem instead.
Comment 10 Rich Felker 2012-06-29 13:06:46 UTC
"People will use memmem instead" is definitely false, for two reasons.

1. memmem is not part of the C language; it's a GNU function. Portable code will not be using it at all.

2. If the input data is a C string, using memmem requires first calling strlen on both the needle and haystack. If the haystack is very large but the needle is expected to be found near the beginning, this increases runtime by a huge margin.

As usual, Drepper was wrong. A quadratic implementation of strstr is just pathologically bad and is subject to DoS in real-world cases (think of a webserver doing text search on user-generated content (perhaps a forum) on behalf of clients; a single client's requests, without any DDoS network, can easily overwhelm the machine with a few carefully crafted needles and haystacks).
Comment 11 Liubov Dmitrieva 2012-06-29 15:34:07 UTC
Ok,

This patch removes quadratic SSE42 strstr for x86_64 and for x86_32

http://sourceware.org/ml/libc-alpha/2012-06/msg00788.html