[PATCH] Speed-up character range regexes by up to 2x

Paolo Bonzini paolo.bonzini@polimi.it
Mon Jan 12 09:50:00 GMT 2004


On single-byte character sets that have no collation elements (or on all 
SBCS outside libc), it is useless to create the COMPLEX_BRACKET node 
that tries to accept multi-byte chars for a character range.  This way 
in those locales (including the C locale) transit_state_mb is never 
called and some more book-keeping disappears: I got a speed improvement 
over 40% matching [A-Z][0-9] against ABCDEFGHIJKLMNOPQRSTUVWXYZ.

This patch does this together with a few other cleanups that I found 
while reading the code.

Please apply also the patch 
http://sources.redhat.com/ml/libc-alpha/2004-01/msg00099.html which is 
needed to compile regex on many non-gcc hosts.

What follows the review of the "gawk guy"'s regex patch:

 > +#ifdef RE_ENABLE_I18N
 >    int icase = (dfa->mb_cur_max == 1 && (bufp->syntax & RE_ICASE));
 > +#else
 > +  int icase = (bufp->syntax & RE_ICASE);
 > +#endif

This is unneeded.

 > @@ -2558,8 +2564,8 @@
 >                 ? __btowc (start_ch) : start_elem->opr.wch);
 >      end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
 >               ? __btowc (end_ch) : end_elem->opr.wch);
 > -    cmp_buf[0] = start_wc;
 > -    cmp_buf[4] = end_wc;
 > +    cmp_buf[0] = start_wc != WEOF ? start_wc : start_ch;
 > +    cmp_buf[4] = end_wc != WEOF ? end_wc : end_ch;
 >      if (wcscoll (cmp_buf, cmp_buf + 4) > 0)
 >        return REG_ERANGE;
 
I am not sure this is the fix; maybe it is better not to include the 
character set if start_wc == WEOF || end_wc == WEOF, or to return 
REG_ERANGE?

 > +#ifdef HAVE_CONFIG_H
 > +#include "config.h"
 > +#endif

The alloca patch does this at the very beginning of the file.

 > +
 > +#if defined (_MSC_VER)
 > +#include <stdio.h> /* for size_t */
 > +#endif
 > +
 > +#include <limits.h>

This is needed.

 > +# elif defined __APPLE_CC__
 > +#  define __restrict

This too.

 > +#if 0
 > +/* Don't include this here. On some systems it sets RE_DUP_MAX to a
 > + * lower value than GNU regex allows.  Instead, include it in
 > + * regex.c, before include of <regex.h>, which correctly
 > + * #undefs RE_DUP_MAX and sets it to the right value.
 > + */
 >  #include <limits.h>
 > +#endif

Can be completely removed?


 >  /* This is for other GNU distributions with internationalized 
messages.  */
 > -#if HAVE_LIBINTL_H || defined _LIBC
 > +#if (HAVE_LIBINTL_H && ENABLE_NLS) || defined _LIBC

Also needed.

 > +#if _LIBC || __GNUC__ >= 3
 > +# define BE(expr, val) __builtin_expect (expr, val)
 > +#else
 > +# define BE(expr, val) (expr)
 > +# define inline
 > +#endif
 > +

Isn't this already there?  Also, shouldn't "#define inline" be taken 
care of in the configure script?

Paolo

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: regex-speedup-char-ranges.patch
URL: <http://sourceware.org/pipermail/libc-alpha/attachments/20040112/9fad83b8/attachment.ksh>


More information about the Libc-alpha mailing list