This is the mail archive of the
newlib@sourceware.org
mailing list for the newlib project.
Re: PATCH] Improve performance of memmem
On 12/18/18 11:01 AM, Wilco Dijkstra wrote:
This patch significantly improves performance of memmem using a novel
modified Horspool algorithm. Needles up to size 256 use a bad-character
table indexed by hashed pairs of characters to quickly skip past mismatches.
Long needles use a self-adapting filtering step to avoid comparing the whole
needle repeatedly.
By limiting the needle length to 256, the shift table only requires 8 bits
per entry, lowering preprocessing overhead and minimizing cache effects.
This limit also implies worst-case performance is linear.
Small needles up to size 2 use a dedicated linear search. Very long needles
use the Two-Way algorithm (to avoid increasing stack size inlining is now disabled).
The performance gain is 6.6 times on English text on AArch64 using random
needles with average size 8 (this is even faster than the recently improved strstr
algorithm, so I'll update that in the near future).
The size-optimized memmem has also been rewritten from scratch to get a
2.7x performance gain.
Tested against GLIBC testsuite and randomized tests.
--
+
+#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
+
+/* Small and efficient memmem implementation. */
I'd strike the word 'efficient' in this comment, as...
+void *
+memmem (const void *haystack, size_t hs_len, const void *needle, size_t ne_len)
+{
+ const char *hs = haystack;
+ const char *ne = needle;
+
+ if (ne_len == 0)
+ return (void *)hs;
+ int i;
+ int c = ne[0];
+ const char *end = hs + hs_len - ne_len;
+
+ for ( ; hs <= end; hs++)
+ {
+ if (hs[0] != c)
+ continue;
+ for (i = ne_len-1; i != 0; i--)
No spacing around the operator?
+ if (hs[i] != ne[i])
+ break;
+ if (i == 0)
+ return (void *)hs;
...this is the naive quadratic implementation. But it does certainly
qualify as the algorithm to use for someone wanting a compilation as
small as possible, and who doesn't care about speed (most likely because
their needles are unlikely to be large enough to notice the effects of
quadratic slowdowns).
+#else
-#if !defined(PREFER_SIZE_OVER_SPEED) && !defined(__OPTIMIZE_SIZE__)
# define RETURN_TYPE void *
# define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
# include "str-two-way.h"
-#endif
+#define hash2(p) ((size_t)((p)[0]-((p)[-1]<<3)) % sizeof (shift))
+
+
+/* Fast memmem algorithm with guaranteed linear-time performance.
+ Small needles up to size 2 use a dedicated linear search. Longer needles
Why is the super-small needle cutoff at 2, rather than 4 or 8 by taking
advantage of vectorized searching?
+ up to size 256 use a novel modified Horspool algorithm. It hashes pairs
+ of characters to quickly skip past mismatches.
+ By limiting the needle length to 256, the shift table can be reduced to 8
+ bits per entry, lowering preprocessing overhead and minimizing cache effects.
+ The limit also implies worst-case performance is linear. On long needles a
+ self-adapting filtering step is used to quickly determine mismatches.
+ Needles larger than 256 characters use the linear-time Two-Way algorithm. */
I can't promise that I tried to completely understand the algorithm, but
the approach makes sense.
+ /* Use Two-Way algorithm for very long needles. */
+ if (__builtin_expect (ne_len > 256, 0))
+ return two_way_long_needle (hs, hs_len, ne, ne_len);
- /* Larger code size, but guaranteed linear performance. */
+ uint8_t shift[256];
+ size_t tmp, shift1;
+ size_t m1 = ne_len - 1;
+ size_t offset = 0;
- /* Sanity check, otherwise the loop might search through the whole
- memory. */
- if (haystack_len < needle_len)
- return NULL;
+ /* Initialize bad character shift hash table. */
+ memset (shift, 0, sizeof (shift));
I don't know if we have a preference for 'sizeof shift' rather than
'sizeof (shift)'.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org