This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Use Quicksearch in strstr


Hi,

Rich Felker wrote:
> Which profiles? The common cases (where significant-length prefix
> matches are usually rare) or the worst cases?

Both - the profiles vary between the different code paths, but I don't see
long matches for good or bad cases. I made the slow cases significantly
slower by using a haystack of random 'a' and 'b' chars. This causes lots of
branch mispredictions by alternating between the slow paths that search
for the first character.

> For needles that factor
> poorly (long right factor), the latter breaks down to essentially
> repeated disjoint memcmp's of the right factor against the whole
> haystack, in successive order, starting over at the beginning of the
> needle's right factor whenever there's a mismatch. That sounds highly
> "memcmp-bound" to me. If worst cases are something you care about
> improving vs the status quo, this is worth pursuing, but if not, okay.

Eventhough it does only 1 character per iteration, finding the first matching
character is more complex and slower, so I haven't seen a case where it is
the bottleneck.

> I agree that for common cases there are potentially much better
> improvements to be made elsewhere. strchr is of course a big help when
> the character to be searched is rare but useless (and pessimizes from
> call/setup overhead unless you can inline it) if it's frequent. It's
> probably better to search for the first up-to-N characters of the
> right factor, where N is a word/vector size, rotating bytes of the
> haystack in/out of a running word/vector for comparison (musl's strstr
> does this for needle sizes up to 4 already, and some data has shown
> it's usually better than two-way a bit further on 64-bit systems).

Yes searching for a fixed size substring would work better indeed since
you get far fewer false matches than when searching for a single char.
I don't think brute force word-based matching is fast enough. I tried a
variant using unaligned accesses which are much faster. However
Quicksearch still beats it. The key is you can use any fast algorithm given
the substring would be fixed-size, so linear time is guaranteed.

> Performing the critical factorization via the maximal suffix process
> described in the two-way paper is also suboptimal, and dependent on
> the actual order of the alphabet. For example hello factors as
> [hell][o] but hellk factors as [he][llk] despite them being
> structurally isomorphic under a renaming of the alphabet. In some
> cases, optimal factorizations can be determined by short-circuiting
> past the MS process. For example if the loop setting up the bad char
> table determines that the final character of the needle is the only
> occurrance of that character, [l-1][1] is a critical factorization and
> p=l. There are larger classes of this type of shortcut but I'm not
> sure what's efficient/reasonable-tradeoff to do. 

Indeed, the critical factorization doesn't appear optimal but I haven't
seen any research that tries to improve it. There must be better ways
of doing it since all you want is a short fixed-size substring with the
right properties. However finding an algorithm that does this efficiently
might not be trivial.

> Searching for a short
> right factor is appealing though because it means you get to skip by
> the whole period as soon as there's any mismatch on the left factor.

Indeed. And apart from highly repetitive needles it would allow for a
simple memcmp for matching the needle.

Wilco



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]