[PATCH] Use memmem algorithm for strstr

Eric Blake eblake@redhat.com
Thu Jan 3 19:18:00 GMT 2019


On 12/27/18 12:30 PM, Wilco Dijkstra wrote:
> This patch further improves performance of strstr by using the new memmem
> algorithm (http://www.cygwin.com/ml/newlib/2018/msg01151.html). 
> The performance gain on English text with randomized needles of average size
> 8 is 25% on small haystacks and 75% for large haystacks.
> 
> Tested against GLIBC testsuite, randomized tests and the GNULIB strstr test
> (https://git.savannah.gnu.org/cgit/gnulib.git/tree/tests/test-strstr.c).
> 
> --

> @@ -150,10 +130,6 @@ strstr (const char *haystack, const char *needle)
>      return (char*)strchr (hs, ne[0]);
>    if (ne[2] == '\0')
>      return strstr2 (hs, ne);
> -  if (ne[3] == '\0')
> -    return strstr3 (hs, ne);
> -  if (ne[4] == '\0')
> -    return strstr4 (hs, ne);
>  
>    size_t ne_len = strlen (ne);
>    size_t hs_len = strnlen (hs, ne_len | 512);

Is the new code faster than strstr[34] on short needles, or should you
be keeping those versions around?

Calling strlen (ne) can be wasteful for someone that calls
strstr("short", "super_long.....") - such callers will always get an
answer of NULL, but the question is how much dead work did we do in the
meantime.  My original Two-Way implementation tried hard to not compute
the bounds of EITHER needle or haystack in isolation, but rather made a
first pass through both simultaneously to detect early exit
opportunities long before an unbounded strlen() on a disproportionately
long counterpart.  Thus, I think this code should be using strnlen (ne,
256) before deciding to switch to Two-Way, rather than an unbounded length.

> @@ -162,39 +138,59 @@ strstr (const char *haystack, const char *needle)
>    if (hs_len < ne_len)
>      return NULL;
>  
> -  /* Use the Quick-Search algorithm for needle lengths less than 255.  */
> -  if (__builtin_expect (ne_len < 255, 1))
> +  /* Use Two-Way algorithm for very long needles.  */
> +  if (__builtin_expect (ne_len > 256, 0))
> +    return two_way_long_needle (hs, hs_len, ne, ne_len);

Is hs_len accurate (or even ne_len, if you take my suggestion to use
strnlen() in its compuatation)?  Or do you first need to (re-)compute
accurate lengths?

> +
> +  const unsigned char *end = hs + hs_len - ne_len;
> +  uint8_t shift[256];
> +  size_t tmp, shift1;
> +  size_t m1 = ne_len - 1;
> +  size_t offset = 0;
> +
> +  /* Initialize bad character shift hash table.  */
> +  memset (shift, 0, sizeof (shift));
> +  for (int i = 1; i < m1; i++)
> +    shift[hash2 (ne + i)] = i;
> +  shift1 = m1 - shift[hash2 (ne + m1)];
> +  shift[hash2 (ne + m1)] = m1;
> +
> +  while (1)
>      {
> -      uint8_t shift[1 << SHIFT_TABLE_BITS];
> -      const unsigned char *end = hs + hs_len - ne_len;
> -
> -      /* Initialize bad character shift hash table.  */
> -      memset (shift, ne_len + 1, sizeof (shift));
> -      for (int i = 0; i < ne_len; i++)
> -	shift[ne[i] % sizeof (shift)] = ne_len - i;
> +      if (__builtin_expect (hs > end, 0))
> +	{
> +	  end += strnlen (end + m1 + 1, 2048);

Why hard-coded 2048 here, vs. ne_len|512 above?

> +	  if (hs > end)
> +	    return NULL;
> +	}
>  
> +      /* Skip past character pairs not in the needle.  */
>        do
>  	{
> -	  hs--;
> -
> -	  /* Search by skipping past bad characters.  */
> -	  size_t tmp = shift[hs[ne_len] % sizeof (shift)];
> -	  for (hs += tmp; hs <= end; hs += tmp)
> -	    {
> -	      tmp = shift[hs[ne_len] % sizeof (shift)];
> -	      if (memcmp (hs, ne, ne_len) == 0)
> -		return (char*) hs;
> -	    }
> -	  if (end[ne_len] == 0)
> -	    return NULL;
> -	  end += strnlen (end + ne_len, 2048);
> +	  hs += m1;
> +	  tmp = shift[hash2 (hs)];
>  	}
> -      while (hs <= end);
> +      while (hs <= end && tmp == 0);
>  
> -      return NULL;
> -    }
> +      /* If the match is not at the end of the needle, shift to the end
> +	 and continue until we match the last 2 characters.  */
> +      hs -= tmp;
> +      if (tmp < m1)
> +	continue;
>  
> -  /* Use Two-Way algorithm for very long needles.  */
> -  return two_way_long_needle (hs, hs_len, ne, ne_len);
> +      /* The last 2 characters match.  If the needle is long, check a
> +	 fixed number of characters first to quickly filter out mismatches.  */
> +      if (m1 <= 15 || memcmp (hs + offset, ne + offset, sizeof (long)) == 0)
> +	{
> +	  if (memcmp (hs, ne, m1) == 0)
> +	    return (void *) hs;
> +
> +	  /* Adjust filter offset when it doesn't find the mismatch.  */
> +	  offset = (offset >= sizeof (long) ? offset : m1) - sizeof (long);
> +	}
> +
> +      /* Skip based on matching the last 2 characters.  */
> +      hs += shift1;
> +    }
>  }
>  #endif /* compilation for speed */
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://sourceware.org/pipermail/newlib/attachments/20190103/b11f063e/attachment.sig>


More information about the Newlib mailing list