This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: PATCH] Improve performance of memmem

From: Eric Blake <eblake at redhat dot com>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "newlib at sourceware dot org" <newlib at sourceware dot org>
Cc: nd <nd at arm dot com>
Date: Tue, 18 Dec 2018 12:09:12 -0600
Subject: Re: PATCH] Improve performance of memmem
References: <HE1PR08MB1035CD64E597AE8519C105C783BD0@HE1PR08MB1035.eurprd08.prod.outlook.com>

On 12/18/18 11:01 AM, Wilco Dijkstra wrote:

This patch significantly improves performance of memmem using a novel
modified Horspool algorithm.  Needles up to size 256 use a bad-character
table indexed by hashed pairs of characters to quickly skip past mismatches.
Long needles use a self-adapting filtering step to avoid comparing the whole
needle repeatedly.

By limiting the needle length to 256, the shift table only requires 8 bits
per entry, lowering preprocessing overhead and minimizing cache effects.
This limit also implies worst-case performance is linear.

Small needles up to size 2 use a dedicated linear search.  Very long needles
use the Two-Way algorithm (to avoid increasing stack size inlining is now disabled).

The performance gain is 6.6 times on English text on AArch64 using random
needles with average size 8 (this is even faster than the recently improved strstr
algorithm, so I'll update that in the near future).

The size-optimized memmem has also been rewritten from scratch to get a
2.7x performance gain.

Tested against GLIBC testsuite and randomized tests.

--

+
+#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
+
+/* Small and efficient memmem implementation.  */


I'd strike the word 'efficient' in this comment, as...

+void *
+memmem (const void *haystack, size_t hs_len, const void *needle, size_t ne_len)
+{
+  const char *hs = haystack;
+  const char *ne = needle;
+
+  if (ne_len == 0)
+    return (void *)hs;
+  int i;
+  int c = ne[0];
+  const char *end = hs + hs_len - ne_len;
+
+  for ( ; hs <= end; hs++)
+  {
+    if (hs[0] != c)
+      continue;
+    for (i = ne_len-1; i != 0; i--)


No spacing around the operator?

+      if (hs[i] != ne[i])
+	break;
+    if (i == 0)
+      return (void *)hs;

...this is the naive quadratic implementation. But it does certainlyqualify as the algorithm to use for someone wanting a compilation assmall as possible, and who doesn't care about speed (most likely becausetheir needles are unlikely to be large enough to notice the effects ofquadratic slowdowns).

+#else
-#if !defined(PREFER_SIZE_OVER_SPEED) && !defined(__OPTIMIZE_SIZE__)
  # define RETURN_TYPE void *
  # define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
  # include "str-two-way.h"
-#endif
+#define hash2(p) ((size_t)((p)[0]-((p)[-1]<<3)) % sizeof (shift))
+
+
+/* Fast memmem algorithm with guaranteed linear-time performance.
+   Small needles up to size 2 use a dedicated linear search.  Longer needles

Why is the super-small needle cutoff at 2, rather than 4 or 8 by takingadvantage of vectorized searching?

+   up to size 256 use a novel modified Horspool algorithm.  It hashes pairs
+   of characters to quickly skip past mismatches.
+   By limiting the needle length to 256, the shift table can be reduced to 8
+   bits per entry, lowering preprocessing overhead and minimizing cache effects.
+   The limit also implies worst-case performance is linear.  On long needles a
+   self-adapting filtering step is used to quickly determine mismatches.
+   Needles larger than 256 characters use the linear-time Two-Way algorithm.  */

I can't promise that I tried to completely understand the algorithm, butthe approach makes sense.

+  /* Use Two-Way algorithm for very long needles.  */
+  if (__builtin_expect (ne_len > 256, 0))
+    return two_way_long_needle (hs, hs_len, ne, ne_len);

- /* Larger code size, but guaranteed linear performance. */

+  uint8_t shift[256];
+  size_t tmp, shift1;
+  size_t m1 = ne_len - 1;
+  size_t offset = 0;

- /* Sanity check, otherwise the loop might search through the whole

-     memory.  */
-  if (haystack_len < needle_len)
-    return NULL;
+  /* Initialize bad character shift hash table.  */
+  memset (shift, 0, sizeof (shift));

I don't know if we have a preference for 'sizeof shift' rather than'sizeof (shift)'.



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Follow-Ups:
- Re: PATCH] Improve performance of memmem
  - From: Wilco Dijkstra
- Re: PATCH] Improve performance of memmem
  - From: Wilco Dijkstra

References:
- PATCH] Improve performance of memmem
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]