[PATCH] Speed-up character range regexes by up to 2x
Paolo Bonzini
paolo.bonzini@polimi.it
Mon Jan 12 09:50:00 GMT 2004
On single-byte character sets that have no collation elements (or on all
SBCS outside libc), it is useless to create the COMPLEX_BRACKET node
that tries to accept multi-byte chars for a character range. This way
in those locales (including the C locale) transit_state_mb is never
called and some more book-keeping disappears: I got a speed improvement
over 40% matching [A-Z][0-9] against ABCDEFGHIJKLMNOPQRSTUVWXYZ.
This patch does this together with a few other cleanups that I found
while reading the code.
Please apply also the patch
http://sources.redhat.com/ml/libc-alpha/2004-01/msg00099.html which is
needed to compile regex on many non-gcc hosts.
What follows the review of the "gawk guy"'s regex patch:
> +#ifdef RE_ENABLE_I18N
> int icase = (dfa->mb_cur_max == 1 && (bufp->syntax & RE_ICASE));
> +#else
> + int icase = (bufp->syntax & RE_ICASE);
> +#endif
This is unneeded.
> @@ -2558,8 +2564,8 @@
> ? __btowc (start_ch) : start_elem->opr.wch);
> end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
> ? __btowc (end_ch) : end_elem->opr.wch);
> - cmp_buf[0] = start_wc;
> - cmp_buf[4] = end_wc;
> + cmp_buf[0] = start_wc != WEOF ? start_wc : start_ch;
> + cmp_buf[4] = end_wc != WEOF ? end_wc : end_ch;
> if (wcscoll (cmp_buf, cmp_buf + 4) > 0)
> return REG_ERANGE;
I am not sure this is the fix; maybe it is better not to include the
character set if start_wc == WEOF || end_wc == WEOF, or to return
REG_ERANGE?
> +#ifdef HAVE_CONFIG_H
> +#include "config.h"
> +#endif
The alloca patch does this at the very beginning of the file.
> +
> +#if defined (_MSC_VER)
> +#include <stdio.h> /* for size_t */
> +#endif
> +
> +#include <limits.h>
This is needed.
> +# elif defined __APPLE_CC__
> +# define __restrict
This too.
> +#if 0
> +/* Don't include this here. On some systems it sets RE_DUP_MAX to a
> + * lower value than GNU regex allows. Instead, include it in
> + * regex.c, before include of <regex.h>, which correctly
> + * #undefs RE_DUP_MAX and sets it to the right value.
> + */
> #include <limits.h>
> +#endif
Can be completely removed?
> /* This is for other GNU distributions with internationalized
messages. */
> -#if HAVE_LIBINTL_H || defined _LIBC
> +#if (HAVE_LIBINTL_H && ENABLE_NLS) || defined _LIBC
Also needed.
> +#if _LIBC || __GNUC__ >= 3
> +# define BE(expr, val) __builtin_expect (expr, val)
> +#else
> +# define BE(expr, val) (expr)
> +# define inline
> +#endif
> +
Isn't this already there? Also, shouldn't "#define inline" be taken
care of in the configure script?
Paolo
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: regex-speedup-char-ranges.patch
URL: <http://sourceware.org/pipermail/libc-alpha/attachments/20040112/9fad83b8/attachment.ksh>
More information about the Libc-alpha
mailing list