PATCH] Improve performance of memmem

Tue Dec 18 20:32:00 GMT 2018

Hi Eric,

>> +    for (i = ne_len-1; i != 0; i--)
>
> No spacing around the operator?

I'll fix that. The hash2 macro could use some spaces too.

>> +/* Small and efficient memmem implementation.  */
>
> I'd strike the word 'efficient' in this comment, as...

> ...this is the naive quadratic implementation.  But it does certainly 
> qualify as the algorithm to use for someone wanting a compilation as 
> small as possible, and who doesn't care about speed (most likely because 
> their needles are unlikely to be large enough to notice the effects of 
> quadratic slowdowns).

However the interesting thing is that this naive implementation is actually
faster than the existing Two-Way implementation on all typical inputs: it is
> 50% faster for any needle up to 32 bytes on any haystack size. So it
would have been better if memmem used the naive implementation for
small needles. The quadratic behaviour only becomes problematic when
needles are like 4KBytes.

I also tried whether memchr/memcmp would be better but the performance
is very similar (and that was using fast assembler implementations which
are disabled when building for size). The only feasible improvement would
be to special case small needles.

> Why is the super-small needle cutoff at 2, rather than 4 or 8 by taking 
> advantage of vectorized searching?

The algorithm is faster for these cases except for very short haystacks
(the break-even is between 64-128 chars), so I've left it out for simplicity.
It may be feasible to improve small needle/small haystack cases using
unaligned loads or by adding a new (assembler) function which searches
for 2 characters, but those are possible future tweaks.

> +   up to size 256 use a novel modified Horspool algorithm.  It hashes pairs
> +   of characters to quickly skip past mismatches.
> +   By limiting the needle length to 256, the shift table can be reduced to 8
> +   bits per entry, lowering preprocessing overhead and minimizing cache effects.
> +   The limit also implies worst-case performance is linear.  On long needles a
> +   self-adapting filtering step is used to quickly determine mismatches.
> +   Needles larger than 256 characters use the linear-time Two-Way algorithm.  */

>> +  memset (shift, 0, sizeof (shift));
>
> I don't know if we have a preference for 'sizeof shift' rather than 
> 'sizeof (shift)'.

Of the 1677 occurences of sizeof in newlib, 984 are "sizeof (x)", 539 are
"sizeof(x)" (missing space) and 74 are sizeof x, so I guess parenthesis is
the preferred spelling.

Cheers,
Wilco