This is the mail archive of the libc-hacker@sources.redhat.com mailing list for the glibc project.
Note that libc-hacker is a closed list. You may look at the archives of this list, but subscription and posting are not open.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[PATCH] Fix regex in tr_TR.UTF-8 locale

From: Jakub Jelinek <jakub at redhat dot com>
To: Ulrich Drepper <drepper at redhat dot com>, Roland McGrath <roland at redhat dot com>
Cc: Glibc hackers <libc-hacker at sources dot redhat dot com>
Date: Thu, 27 Nov 2003 16:52:39 +0100
Subject: [PATCH] Fix regex in tr_TR.UTF-8 locale
Reply-to: Jakub Jelinek <jakub at redhat dot com>
Hi!

Well, other locales, though with tr_TR.UTF-8 it is most visible.
Regex was not able to cope with upper and lower case characters
encoded in different number of bytes.  This patch also fixes regex.translate
handling, so that it is used consistently.  Previously it was sometimes
applied on raw bytes from strings, sometimes on a wide char etc.
make check subdirs=posix passes even under ElectricFence.
Something like this can happen even in e.g. en_US.UTF-8 locale,
when doing REG_ICASE searching on string including '\xc4' '\xb1'
bytes where towupper (0x131) == 'I' (and similarly for U17f and U1fbe).
I've changed some bitfields in re_string_t to unsigned char, since
they are used in performance critical places in some cases and
there is always just one re_string_t object for the lifetime of
regcomp or regexec.  tst-regex.out numbers look basically the same
before and after this patch.

2003-11-27  Jakub Jelinek  <jakub@redhat.com>

	* posix/regex_internal.h (re_string_t): Remove mbs_case field.
	Add offsets, valid_raw_len, raw_len, raw_stop, mbs_allocated and
	offsets_needed fields.  Change icase, is_utf8 and map_notascii
	type from int bitfield to unsigned char.
	(MBS_ALLOCATED, MBS_CASE_ALLOCATED): Remove.
	(build_wcs_upper_buffer): Change prototype to return int.
	(re_string_peek_byte_case, re_string_fetch_byte_case): Remove
	defines, add prototypes.
	* posix/regex_internal.c (re_string_allocate): Don't initialize
	stop here.  Don't initialize mbs_case.  Set valid_raw_len.
	Use mbs_allocated instead of MBS_* macros.
	(re_string_construct): Don't initialize stop and valid_len here.
	Don't initialize mbs_case.  Use mbs_allocated instead of MBS_*
	macros.  Reallocate buffers if build_wcs_upper_buffer converted
	too few bytes.  Set valid_len to bufs_len only for single byte
	no translation and set in that case valid_raw_len as well.
	(re_string_realloc_buffers): Reallocate offsets if not NULL.
	Use mbs_allocated instead of MBS_ALLOCATED.  Don't reallocate
	mbs_case.
	(re_string_construct_common): Initialize raw_len, mbs_allocated,
	stop and raw_stop.
	(build_wcs_buffer): Apply pstr->trans before mbrtowc instead of
	after it.  Set valid_raw_len.  Don't set mbs_case.
	(build_wcs_upper_buffer): Return REG_NOERROR or REG_ESPACE.
	Only use the fast path if !pstr->offsets_needed.  Apply pstr->trans
	before mbrtowc instead of after it.  If upper case character
	uses different number of bytes than lower case, goto to the
	slow path.  Don't call towupper unnecessarily twice.  Set
	valid_raw_len as well.  Handle in the slow path the case if
	lower and upper case use different number of characters.
	Don't set mbs_case.
	(re_string_skip_chars): Use valid_raw_len instead of valid_len.
	(build_upper_buffer): Don't set mbs_case.  Add BE macro.  Set
	valid_raw_len.
	(re_string_translate_buffer): Set mbs instead of mbs_case.  Set
	valid_raw_len.
	(re_string_reconstruct): Use raw_len/raw_stop to initialize
	len/stop.  Clear valid_raw_len and offsets_needed when clearing
	valid_len.  Use mbs_allocated instead of MBS_* macros.
	Check original offset against valid_raw_len instead of valid_len.
	Remove mbs_case handling.  Adjust valid_raw_len together with
	valid_len.  If is_utf8 and looking for tip context, apply
	pstr->trans first.  If buffers start with partial multi-byte
	character, initialize mbs array as well if mbs_allocated.
	Check return value of build_wcs_upper_buffer.
	(re_string_peek_byte_case): New function.
	(re_string_fetch_byte_case): New function.
	(re_string_destruct): Use mbs_allocated instead of MBS_ALLOCATED.
	Don't free mbs_case.  Free offsets.
	* posix/regcomp.c (init_dfa): Only check if charset name is UTF-8
	if mb_cur_max == 6.
	* posix/regexec.c (re_search_internal): Initialize input.raw_stop
	as well.  Use valid_raw_len instead of valid_len when looking
	through fastmap.  Adjust registers through input.offsets.
	(extend_buffers): Allow build_wcs_upper_buffer to fail.
	* posix/bug-regex18.c (tests): Enable #ifdefed out tests.  Add new
	tests.

--- libc/posix/regex_internal.h.jj	2003-11-24 23:49:53.000000000 +0100
+++ libc/posix/regex_internal.h	2003-11-27 15:44:00.000000000 +0100
@@ -302,13 +302,10 @@ struct re_string_t
      REG_ICASE, upper cases of the string are stored, otherwise MBS points
      the same address that RAW_MBS points.  */
   unsigned char *mbs;
-  /* Store the case sensitive multibyte string.  In case of
-     "case insensitive mode", the original string are stored,
-     otherwise MBS_CASE points the same address that MBS points.  */
-  unsigned char *mbs_case;
 #ifdef RE_ENABLE_I18N
   /* Store the wide character string which is corresponding to MBS.  */
   wint_t *wcs;
+  int *offsets;
   mbstate_t cur_state;
 #endif
   /* Index in RAW_MBS.  Each character mbs[i] corresponds to
@@ -316,15 +313,21 @@ struct re_string_t
   int raw_mbs_idx;
   /* The length of the valid characters in the buffers.  */
   int valid_len;
-  /* The length of the buffers MBS, MBS_CASE, and WCS.  */
+  /* The corresponding number of bytes in raw_mbs array.  */
+  int valid_raw_len;
+  /* The length of the buffers MBS and WCS.  */
   int bufs_len;
   /* The index in MBS, which is updated by re_string_fetch_byte.  */
   int cur_idx;
-  /* This is length_of_RAW_MBS - RAW_MBS_IDX.  */
+  /* length of RAW_MBS array.  */
+  int raw_len;
+  /* This is RAW_LEN - RAW_MBS_IDX + VALID_LEN - VALID_RAW_LEN.  */
   int len;
   /* End of the buffer may be shorter than its length in the cases such
      as re_match_2, re_search_2.  Then, we use STOP for end of the buffer
      instead of LEN.  */
+  int raw_stop;
+  /* This is RAW_STOP - RAW_MBS_IDX adjusted through OFFSETS.  */
   int stop;
 
   /* The context of mbs[0].  We store the context independently, since
@@ -334,17 +337,14 @@ struct re_string_t
   /* The translation passed as a part of an argument of re_compile_pattern.  */
   RE_TRANSLATE_TYPE trans;
   /* 1 if REG_ICASE.  */
-  unsigned int icase : 1;
-  unsigned int is_utf8 : 1;
-  unsigned int map_notascii : 1;
+  unsigned char icase;
+  unsigned char is_utf8;
+  unsigned char map_notascii;
+  unsigned char mbs_allocated;
+  unsigned char offsets_needed;
   int mb_cur_max;
 };
 typedef struct re_string_t re_string_t;
-/* In case of REG_ICASE, we allocate the buffer dynamically for mbs.  */
-#define MBS_ALLOCATED(pstr) (pstr->icase)
-/* In case that we need translation, we allocate the buffer dynamically
-   for mbs_case.  Note that mbs == mbs_case if not REG_ICASE.  */
-#define MBS_CASE_ALLOCATED(pstr) (pstr->trans != NULL)
 
 
 struct re_dfa_t;
@@ -363,7 +363,7 @@ static reg_errcode_t re_string_realloc_b
 						int new_buf_len);
 # ifdef RE_ENABLE_I18N
 static void build_wcs_buffer (re_string_t *pstr);
-static void build_wcs_upper_buffer (re_string_t *pstr);
+static int build_wcs_upper_buffer (re_string_t *pstr);
 # endif /* RE_ENABLE_I18N */
 static void build_upper_buffer (re_string_t *pstr);
 static void re_string_translate_buffer (re_string_t *pstr);
@@ -375,15 +375,14 @@ static inline wint_t re_string_wchar_at 
 # endif /* RE_ENABLE_I18N */
 static unsigned int re_string_context_at (const re_string_t *input, int idx,
 					  int eflags, int newline_anchor);
+static unsigned char re_string_peek_byte_case (const re_string_t *pstr,
+					       int idx);
+static unsigned char re_string_fetch_byte_case (re_string_t *pstr);
 #endif
 #define re_string_peek_byte(pstr, offset) \
   ((pstr)->mbs[(pstr)->cur_idx + offset])
-#define re_string_peek_byte_case(pstr, offset) \
-  ((pstr)->mbs_case[(pstr)->cur_idx + offset])
 #define re_string_fetch_byte(pstr) \
   ((pstr)->mbs[(pstr)->cur_idx++])
-#define re_string_fetch_byte_case(pstr) \
-  ((pstr)->mbs_case[(pstr)->cur_idx++])
 #define re_string_first_byte(pstr, idx) \
   ((idx) == (pstr)->valid_len || (pstr)->wcs[idx] != WEOF)
 #define re_string_is_single_byte_char(pstr, idx) \
--- libc/posix/regex_internal.c.jj	2003-11-26 09:25:16.000000000 +0100
+++ libc/posix/regex_internal.c	2003-11-27 18:23:56.000000000 +0100
@@ -62,17 +62,14 @@ re_string_allocate (pstr, str, len, init
     init_len = dfa->mb_cur_max;
   init_buf_len = (len + 1 < init_len) ? len + 1: init_len;
   re_string_construct_common (str, len, pstr, trans, icase, dfa);
-  pstr->stop = pstr->len;
 
   ret = re_string_realloc_buffers (pstr, init_buf_len);
   if (BE (ret != REG_NOERROR, 0))
     return ret;
 
-  pstr->mbs_case = (MBS_CASE_ALLOCATED (pstr) ? pstr->mbs_case
-		    : (unsigned char *) str);
-  pstr->mbs = MBS_ALLOCATED (pstr) ? pstr->mbs : pstr->mbs_case;
-  pstr->valid_len = (MBS_CASE_ALLOCATED (pstr) || MBS_ALLOCATED (pstr)
-		     || dfa->mb_cur_max > 1) ? pstr->valid_len : len;
+  pstr->mbs = pstr->mbs_allocated ? pstr->mbs : (unsigned char *) str;
+  pstr->valid_len = (pstr->mbs_allocated || dfa->mb_cur_max > 1) ? 0 : len;
+  pstr->valid_raw_len = pstr->valid_len;
   return REG_NOERROR;
 }
 
@@ -88,9 +85,6 @@ re_string_construct (pstr, str, len, tra
 {
   reg_errcode_t ret;
   re_string_construct_common (str, len, pstr, trans, icase, dfa);
-  pstr->stop = pstr->len;
-  /* Set 0 so that this function can initialize whole buffers.  */
-  pstr->valid_len = 0;
 
   if (len > 0)
     {
@@ -98,15 +92,27 @@ re_string_construct (pstr, str, len, tra
       if (BE (ret != REG_NOERROR, 0))
 	return ret;
     }
-  pstr->mbs_case = (MBS_CASE_ALLOCATED (pstr) ? pstr->mbs_case
-		    : (unsigned char *) str);
-  pstr->mbs = MBS_ALLOCATED (pstr) ? pstr->mbs : pstr->mbs_case;
+  pstr->mbs = pstr->mbs_allocated ? pstr->mbs : (unsigned char *) str;
 
   if (icase)
     {
 #ifdef RE_ENABLE_I18N
       if (dfa->mb_cur_max > 1)
-	build_wcs_upper_buffer (pstr);
+	{
+	  while (1)
+	    {
+	      ret = build_wcs_upper_buffer (pstr);
+	      if (BE (ret != REG_NOERROR, 0))
+		return ret;
+	      if (pstr->valid_raw_len >= len)
+		break;
+	      if (pstr->bufs_len > pstr->valid_len + dfa->mb_cur_max)
+		break;
+	      ret = re_string_realloc_buffers (pstr, pstr->bufs_len * 2);
+	      if (BE (ret != REG_NOERROR, 0))
+		return ret;
+	    }
+	}
       else
 #endif /* RE_ENABLE_I18N  */
 	build_upper_buffer (pstr);
@@ -122,12 +128,13 @@ re_string_construct (pstr, str, len, tra
 	  if (trans != NULL)
 	    re_string_translate_buffer (pstr);
 	  else
-	    pstr->valid_len = len;
+	    {
+	      pstr->valid_len = pstr->bufs_len;
+	      pstr->valid_raw_len = pstr->bufs_len;
+	    }
 	}
     }
 
-  /* Initialized whole buffers, then valid_len == bufs_len.  */
-  pstr->valid_len = pstr->bufs_len;
   return REG_NOERROR;
 }
 
@@ -145,9 +152,16 @@ re_string_realloc_buffers (pstr, new_buf
       if (BE (new_array == NULL, 0))
 	return REG_ESPACE;
       pstr->wcs = new_array;
+      if (pstr->offsets != NULL)
+	{
+	  int *new_array = re_realloc (pstr->offsets, int, new_buf_len);
+	  if (BE (new_array == NULL, 0))
+	    return REG_ESPACE;
+	  pstr->offsets = new_array;
+	}
     }
 #endif /* RE_ENABLE_I18N  */
-  if (MBS_ALLOCATED (pstr))
+  if (pstr->mbs_allocated)
     {
       unsigned char *new_array = re_realloc (pstr->mbs, unsigned char,
 					     new_buf_len);
@@ -155,16 +169,6 @@ re_string_realloc_buffers (pstr, new_buf
 	return REG_ESPACE;
       pstr->mbs = new_array;
     }
-  if (MBS_CASE_ALLOCATED (pstr))
-    {
-      unsigned char *new_array = re_realloc (pstr->mbs_case, unsigned char,
-					     new_buf_len);
-      if (BE (new_array == NULL, 0))
-	return REG_ESPACE;
-      pstr->mbs_case = new_array;
-      if (!MBS_ALLOCATED (pstr))
-	pstr->mbs = pstr->mbs_case;
-    }
   pstr->bufs_len = new_buf_len;
   return REG_NOERROR;
 }
@@ -182,11 +186,15 @@ re_string_construct_common (str, len, ps
   memset (pstr, '\0', sizeof (re_string_t));
   pstr->raw_mbs = (const unsigned char *) str;
   pstr->len = len;
+  pstr->raw_len = len;
   pstr->trans = trans;
   pstr->icase = icase ? 1 : 0;
+  pstr->mbs_allocated = (trans != NULL || icase);
   pstr->mb_cur_max = dfa->mb_cur_max;
   pstr->is_utf8 = dfa->is_utf8;
   pstr->map_notascii = dfa->map_notascii;
+  pstr->stop = pstr->len;
+  pstr->raw_stop = pstr->stop;
 }
 
 #ifdef RE_ENABLE_I18N
@@ -206,18 +214,39 @@ static void
 build_wcs_buffer (pstr)
      re_string_t *pstr;
 {
+#ifdef _LIBC
+  unsigned char buf[pstr->mb_cur_max];
+#else
+  unsigned char buf[64];
+#endif
   mbstate_t prev_st;
   int byte_idx, end_idx, mbclen, remain_len;
+
   /* Build the buffers from pstr->valid_len to either pstr->len or
      pstr->bufs_len.  */
-  end_idx = (pstr->bufs_len > pstr->len)? pstr->len : pstr->bufs_len;
+  end_idx = (pstr->bufs_len > pstr->len) ? pstr->len : pstr->bufs_len;
   for (byte_idx = pstr->valid_len; byte_idx < end_idx;)
     {
       wchar_t wc;
+      const char *p;
+
       remain_len = end_idx - byte_idx;
       prev_st = pstr->cur_state;
-      mbclen = mbrtowc (&wc, ((const char *) pstr->raw_mbs + pstr->raw_mbs_idx
-			      + byte_idx), remain_len, &pstr->cur_state);
+      /* Apply the translation if we need.  */
+      if (BE (pstr->trans != NULL, 0))
+	{
+	  int i, ch;
+
+	  for (i = 0; i < pstr->mb_cur_max && i < remain_len; ++i)
+	    {
+	      ch = pstr->raw_mbs [pstr->raw_mbs_idx + byte_idx + i];
+	      buf[i] = pstr->trans[ch];
+	    }
+	  p = (const char *) buf;
+	}
+      else
+	p = (const char *) pstr->raw_mbs + pstr->raw_mbs_idx + byte_idx;
+      mbclen = mbrtowc (&wc, p, remain_len, &pstr->cur_state);
       if (BE (mbclen == (size_t) -2, 0))
 	{
 	  /* The buffer doesn't have enough space, finish to build.  */
@@ -229,15 +258,11 @@ build_wcs_buffer (pstr)
 	  /* We treat these cases as a singlebyte character.  */
 	  mbclen = 1;
 	  wc = (wchar_t) pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
+	  if (BE (pstr->trans != NULL, 0))
+	    wc = pstr->trans[wc];
 	  pstr->cur_state = prev_st;
 	}
 
-      /* Apply the translation if we need.  */
-      if (pstr->trans != NULL && mbclen == 1)
-	{
-	  int ch = pstr->trans[pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]];
-	  pstr->mbs_case[byte_idx] = ch;
-	}
       /* Write wide character and padding.  */
       pstr->wcs[byte_idx++] = wc;
       /* Write paddings.  */
@@ -245,61 +270,83 @@ build_wcs_buffer (pstr)
 	pstr->wcs[byte_idx++] = WEOF;
     }
   pstr->valid_len = byte_idx;
+  pstr->valid_raw_len = byte_idx;
 }
 
 /* Build wide character buffer PSTR->WCS like build_wcs_buffer,
    but for REG_ICASE.  */
 
-static void
+static int
 build_wcs_upper_buffer (pstr)
      re_string_t *pstr;
 {
   mbstate_t prev_st;
-  int byte_idx, end_idx, mbclen, remain_len;
-  /* Build the buffers from pstr->valid_len to either pstr->len or
-     pstr->bufs_len.  */
-  end_idx = (pstr->bufs_len > pstr->len)? pstr->len : pstr->bufs_len;
+  int src_idx, byte_idx, end_idx, mbclen, remain_len;
+#ifdef _LIBC
+  unsigned char buf[pstr->mb_cur_max];
+#else
+  unsigned char buf[64];
+#endif
+
+  byte_idx = pstr->valid_len;
+  end_idx = (pstr->bufs_len > pstr->len) ? pstr->len : pstr->bufs_len;
 
 #ifdef _LIBC
   /* The following optimization assumes that the wchar_t encoding is
      always ISO 10646.  */
-  if (! pstr->map_notascii && pstr->trans == NULL)
-    for (byte_idx = pstr->valid_len; byte_idx < end_idx;)
-      if (isascii (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx])
-	  && mbsinit (&pstr->cur_state))
-	{
-	  /* In case of a singlebyte character.  */
-	  pstr->mbs[byte_idx]
-	    = toupper (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]);
-	  /* The next step uses the assumption that wchar_t is encoded
-	     with ISO 10646: all ASCII values can be converted like this.  */
-	  pstr->wcs[byte_idx] = (wchar_t) pstr->mbs[byte_idx];
-	  ++byte_idx;
-	}
-      else
+  if (! pstr->map_notascii && pstr->trans == NULL && !pstr->offsets_needed)
+    {
+      while (byte_idx < end_idx)
 	{
 	  wchar_t wc;
+
+	  if (isascii (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx])
+	      && mbsinit (&pstr->cur_state))
+	    {
+	      /* In case of a singlebyte character.  */
+	      pstr->mbs[byte_idx]
+		= toupper (pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx]);
+	      /* The next step uses the assumption that wchar_t is encoded
+		 with ISO 10646: all ASCII values can be converted like
+		 this.  */
+	      pstr->wcs[byte_idx] = (wchar_t) pstr->mbs[byte_idx];
+	      ++byte_idx;
+	      continue;
+	    }
+
 	  remain_len = end_idx - byte_idx;
 	  prev_st = pstr->cur_state;
 	  mbclen = mbrtowc (&wc,
 			    ((const char *) pstr->raw_mbs + pstr->raw_mbs_idx
 			     + byte_idx), remain_len, &pstr->cur_state);
-	  if (BE (mbclen > 1, 1))
+	  if (BE (mbclen > 0, 1))
 	    {
+	      wchar_t wcu = wc;
 	      if (iswlower (wc))
-		wcrtomb ((char *) pstr->mbs + byte_idx, towupper (wc),
-			 &prev_st);
+		{
+		  int mbcdlen;
+
+		  wcu = towupper (wc);
+		  mbcdlen = wcrtomb (buf, wcu, &prev_st);
+		  if (BE (mbclen == mbcdlen, 1))
+		    memcpy (pstr->mbs + byte_idx, buf, mbclen);
+		  else
+		    {
+		      src_idx = byte_idx;
+		      goto offsets_needed;
+		    }
+		}
 	      else
 		memcpy (pstr->mbs + byte_idx,
 			pstr->raw_mbs + pstr->raw_mbs_idx + byte_idx, mbclen);
-	      pstr->wcs[byte_idx++] = towupper (wc);
+	      pstr->wcs[byte_idx++] = wcu;
 	      /* Write paddings.  */
 	      for (remain_len = byte_idx + mbclen - 1; byte_idx < remain_len ;)
 		pstr->wcs[byte_idx++] = WEOF;
 	    }
 	  else if (mbclen == (size_t) -1 || mbclen == 0)
 	    {
-	      /* It is an invalid character.  Just use the byte.  */
+	      /* It is an invalid character or '\0'.  Just use the byte.  */
 	      int ch = pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
 	      pstr->mbs[byte_idx] = ch;
 	      /* And also cast it to wide char.  */
@@ -314,48 +361,116 @@ build_wcs_upper_buffer (pstr)
 	      break;
 	    }
 	}
+      pstr->valid_len = byte_idx;
+      pstr->valid_raw_len = byte_idx;
+      return REG_NOERROR;
+    }
   else
 #endif
-    for (byte_idx = pstr->valid_len; byte_idx < end_idx;)
+    for (src_idx = pstr->valid_raw_len; byte_idx < end_idx;)
       {
 	wchar_t wc;
+	const char *p;
+offsets_needed:
 	remain_len = end_idx - byte_idx;
 	prev_st = pstr->cur_state;
-	mbclen = mbrtowc (&wc,
-			  ((const char *) pstr->raw_mbs + pstr->raw_mbs_idx
-			   + byte_idx), remain_len, &pstr->cur_state);
-	if (mbclen == 1)
+	if (BE (pstr->trans != NULL, 0))
 	  {
-	    /* In case of a singlebyte character.  */
-	    int ch = pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
-	    /* Apply the translation if we need.  */
-	    if (BE (pstr->trans != NULL, 0) && mbclen == 1)
+	    int i, ch;
+
+	    for (i = 0; i < pstr->mb_cur_max && i < remain_len; ++i)
 	      {
-		ch = pstr->trans[ch];
-		pstr->mbs_case[byte_idx] = ch;
+		ch = pstr->raw_mbs [pstr->raw_mbs_idx + src_idx + i];
+		buf[i] = pstr->trans[ch];
 	      }
-	    pstr->wcs[byte_idx] = towupper (wc);
-	    pstr->mbs[byte_idx++] = toupper (ch);
-	    if (BE (mbclen == (size_t) -1, 0))
-	      pstr->cur_state = prev_st;
+	    p = (const char *) buf;
 	  }
-	else if (BE (mbclen != (size_t) -2, 1))
+	else
+	  p = (const char *) pstr->raw_mbs + pstr->raw_mbs_idx + src_idx;
+	mbclen = mbrtowc (&wc, p, remain_len, &pstr->cur_state);
+	if (BE (mbclen > 0, 1))
 	  {
+	    wchar_t wcu = wc;
 	    if (iswlower (wc))
-	      wcrtomb ((char *) pstr->mbs + byte_idx, towupper (wc), &prev_st);
+	      {
+		int mbcdlen;
+
+		wcu = towupper (wc);
+		mbcdlen = wcrtomb (buf, wcu, &prev_st);
+		if (BE (mbclen == mbcdlen, 1))
+		  memcpy (pstr->mbs + byte_idx, buf, mbclen);
+		else
+		  {
+		    int i;
+
+		    if (byte_idx + mbcdlen > pstr->bufs_len)
+		      {
+			pstr->cur_state = prev_st;
+			break;
+		      }
+
+		    if (pstr->offsets == NULL)
+		      {
+			pstr->offsets = re_malloc (int, pstr->bufs_len);
+
+			if (pstr->offsets == NULL)
+			  return REG_ESPACE;
+		      }
+		    if (!pstr->offsets_needed)
+		      {
+			for (i = 0; i < byte_idx; ++i)
+			  pstr->offsets[i] = i;
+			pstr->offsets_needed = 1;
+		      }
+
+		    memcpy (pstr->mbs + byte_idx, buf, mbcdlen);
+		    pstr->wcs[byte_idx] = wcu;
+		    pstr->offsets[byte_idx] = src_idx;
+		    for (i = 1; i < mbcdlen; ++i)
+		      {
+			pstr->offsets[byte_idx + i]
+			  = src_idx + (i < mbclen ? i : mbclen - 1);
+			pstr->wcs[byte_idx + i] = WEOF;
+		      }
+		    pstr->len += mbcdlen - mbclen;
+		    if (pstr->raw_stop > src_idx)
+		      pstr->stop += mbcdlen - mbclen;
+		    end_idx = (pstr->bufs_len > pstr->len)
+			      ? pstr->len : pstr->bufs_len;
+		    byte_idx += mbcdlen;
+		    src_idx += mbclen;
+		    continue;
+		  }
+	      }
 	    else
-	      memcpy (pstr->mbs + byte_idx,
-		      pstr->raw_mbs + pstr->raw_mbs_idx + byte_idx, mbclen);
-	    pstr->wcs[byte_idx++] = towupper (wc);
+	      memcpy (pstr->mbs + byte_idx, p, mbclen);
+
+	    if (BE (pstr->offsets_needed != 0, 0))
+	      {
+		int i;
+		for (i = 0; i < mbclen; ++i)
+		  pstr->offsets[byte_idx + i] = src_idx + i;
+	      }
+	    src_idx += mbclen;
+
+	    pstr->wcs[byte_idx++] = wcu;
 	    /* Write paddings.  */
 	    for (remain_len = byte_idx + mbclen - 1; byte_idx < remain_len ;)
 	      pstr->wcs[byte_idx++] = WEOF;
 	  }
 	else if (mbclen == (size_t) -1 || mbclen == 0)
 	  {
-	    /* It is an invalid character.  Just use the byte.  */
-	    int ch = pstr->raw_mbs[pstr->raw_mbs_idx + byte_idx];
+	    /* It is an invalid character or '\0'.  Just use the byte.  */
+	    int ch = pstr->raw_mbs[pstr->raw_mbs_idx + src_idx];
+
+	    if (BE (pstr->trans != NULL, 0))
+	      ch = pstr->trans [ch];
 	    pstr->mbs[byte_idx] = ch;
+
+	    if (BE (pstr->offsets_needed != 0, 0))
+	      pstr->offsets[byte_idx] = src_idx;
+	    ++src_idx;
+
 	    /* And also cast it to wide char.  */
 	    pstr->wcs[byte_idx++] = (wchar_t) ch;
 	    if (BE (mbclen == (size_t) -1, 0))
@@ -369,6 +484,8 @@ build_wcs_upper_buffer (pstr)
 	  }
       }
   pstr->valid_len = byte_idx;
+  pstr->valid_raw_len = src_idx;
+  return REG_NOERROR;
 }
 
 /* Skip characters until the index becomes greater than NEW_RAW_IDX.
@@ -385,7 +502,7 @@ re_string_skip_chars (pstr, new_raw_idx,
   wchar_t wc = 0;
 
   /* Skip the characters which are not necessary to check.  */
-  for (rawbuf_idx = pstr->raw_mbs_idx + pstr->valid_len;
+  for (rawbuf_idx = pstr->raw_mbs_idx + pstr->valid_raw_len;
        rawbuf_idx < new_raw_idx;)
     {
       int remain_len;
@@ -420,17 +537,15 @@ build_upper_buffer (pstr)
   for (char_idx = pstr->valid_len; char_idx < end_idx; ++char_idx)
     {
       int ch = pstr->raw_mbs[pstr->raw_mbs_idx + char_idx];
-      if (pstr->trans != NULL)
-	{
-	  ch = pstr->trans[ch];
-	  pstr->mbs_case[char_idx] = ch;
-	}
+      if (BE (pstr->trans != NULL, 0))
+	ch = pstr->trans[ch];
       if (islower (ch))
 	pstr->mbs[char_idx] = toupper (ch);
       else
 	pstr->mbs[char_idx] = ch;
     }
   pstr->valid_len = char_idx;
+  pstr->valid_raw_len = char_idx;
 }
 
 /* Apply TRANS to the buffer in PSTR.  */
@@ -445,10 +560,11 @@ re_string_translate_buffer (pstr)
   for (buf_idx = pstr->valid_len; buf_idx < end_idx; ++buf_idx)
     {
       int ch = pstr->raw_mbs[pstr->raw_mbs_idx + buf_idx];
-      pstr->mbs_case[buf_idx] = pstr->trans[ch];
+      pstr->mbs[buf_idx] = pstr->trans[ch];
     }
 
   pstr->valid_len = buf_idx;
+  pstr->valid_raw_len = buf_idx;
 }
 
 /* This function re-construct the buffers.
@@ -468,14 +584,15 @@ re_string_reconstruct (pstr, idx, eflags
       if (pstr->mb_cur_max > 1)
 	memset (&pstr->cur_state, '\0', sizeof (mbstate_t));
 #endif /* RE_ENABLE_I18N */
-      pstr->len += pstr->raw_mbs_idx;
-      pstr->stop += pstr->raw_mbs_idx;
-      pstr->valid_len = pstr->raw_mbs_idx = 0;
+      pstr->len = pstr->raw_len;
+      pstr->stop = pstr->raw_stop;
+      pstr->valid_len = 0;
+      pstr->raw_mbs_idx = 0;
+      pstr->valid_raw_len = 0;
+      pstr->offsets_needed = 0;
       pstr->tip_context = ((eflags & REG_NOTBOL) ? CONTEXT_BEGBUF
 			   : CONTEXT_NEWLINE | CONTEXT_BEGBUF);
-      if (!MBS_CASE_ALLOCATED (pstr))
-	pstr->mbs_case = (unsigned char *) pstr->raw_mbs;
-      if (!MBS_ALLOCATED (pstr) && !MBS_CASE_ALLOCATED (pstr))
+      if (!pstr->mbs_allocated)
 	pstr->mbs = (unsigned char *) pstr->raw_mbs;
       offset = idx;
     }
@@ -483,7 +600,13 @@ re_string_reconstruct (pstr, idx, eflags
   if (offset != 0)
     {
       /* Are the characters which are already checked remain?  */
-      if (offset < pstr->valid_len)
+      if (offset < pstr->valid_raw_len
+#ifdef RE_ENABLE_I18N
+	  /* Handling this would enlarge the code too much.
+	     Accept a slowdown in that case.  */
+	  && pstr->offsets_needed == 0
+#endif
+	 )
 	{
 	  /* Yes, move them to the front of the buffer.  */
 	  pstr->tip_context = re_string_context_at (pstr, offset - 1, eflags,
@@ -493,13 +616,11 @@ re_string_reconstruct (pstr, idx, eflags
 	    memmove (pstr->wcs, pstr->wcs + offset,
 		     (pstr->valid_len - offset) * sizeof (wint_t));
 #endif /* RE_ENABLE_I18N */
-	  if (MBS_ALLOCATED (pstr))
+	  if (pstr->mbs_allocated)
 	    memmove (pstr->mbs, pstr->mbs + offset,
 		     pstr->valid_len - offset);
-	  if (MBS_CASE_ALLOCATED (pstr))
-	    memmove (pstr->mbs_case, pstr->mbs_case + offset,
-		     pstr->valid_len - offset);
 	  pstr->valid_len -= offset;
+	  pstr->valid_raw_len -= offset;
 #if DEBUG
 	  assert (pstr->valid_len > 0);
 #endif
@@ -507,16 +628,26 @@ re_string_reconstruct (pstr, idx, eflags
       else
 	{
 	  /* No, skip all characters until IDX.  */
+#ifdef RE_ENABLE_I18N
+	  if (BE (pstr->offsets_needed, 0))
+	    {
+	      pstr->len = pstr->raw_len - idx + offset;
+	      pstr->stop = pstr->raw_stop - idx + offset;
+	      pstr->offsets_needed = 0;
+	    }
+#endif
 	  pstr->valid_len = 0;
+	  pstr->valid_raw_len = 0;
 #ifdef RE_ENABLE_I18N
 	  if (pstr->mb_cur_max > 1)
 	    {
 	      int wcs_idx;
 	      wint_t wc = WEOF;
 
+#ifdef _LIBC
 	      if (pstr->is_utf8)
 		{
-		  const unsigned char *raw, *p, *end;
+		  const unsigned char *raw, *p, *q, *end;
 
 		  /* Special case UTF-8.  Multi-byte chars start with any
 		     byte other than 0x80 - 0xbf.  */
@@ -527,13 +658,22 @@ re_string_reconstruct (pstr, idx, eflags
 		      {
 			mbstate_t cur_state;
 			wchar_t wc2;
-			int mlen;
+			int mlen = raw + pstr->len - p;
+			unsigned char buf[6];
 
+			q = p;
+			if (BE (pstr->trans != NULL, 0))
+			  {
+			    int i = mlen < 6 ? mlen : 6;
+			    while (--i >= 0)
+			      buf[i] = pstr->trans[p[i]];
+			    q = buf;
+			  }
 			/* XXX Don't use mbrtowc, we know which conversion
 			   to use (UTF-8 -> UCS4).  */
 			memset (&cur_state, 0, sizeof (cur_state));
-			mlen = mbrtowc (&wc2, p, raw + pstr->len - p,
-					&cur_state) - (raw + offset - p);
+			mlen = mbrtowc (&wc2, p, mlen, &cur_state)
+			       - (raw + offset - p);
 			if (mlen >= 0)
 			  {
 			    memset (&pstr->cur_state, '\0',
@@ -544,12 +684,17 @@ re_string_reconstruct (pstr, idx, eflags
 			break;
 		      }
 		}
+#endif
 	      if (wc == WEOF)
 		pstr->valid_len = re_string_skip_chars (pstr, idx, &wc) - idx;
-	      for (wcs_idx = 0; wcs_idx < pstr->valid_len; ++wcs_idx)
-		pstr->wcs[wcs_idx] = WEOF;
-	      if (pstr->trans && wc <= 0xff)
-		wc = pstr->trans[wc];
+	      if (BE (pstr->valid_len, 0))
+		{
+		  for (wcs_idx = 0; wcs_idx < pstr->valid_len; ++wcs_idx)
+		    pstr->wcs[wcs_idx] = WEOF;
+		  if (pstr->mbs_allocated)
+		    memset (pstr->mbs, 255, pstr->valid_len);
+		}
+	      pstr->valid_raw_len = pstr->valid_len;
 	      pstr->tip_context = (IS_WIDE_WORD_CHAR (wc) ? CONTEXT_WORD
 				   : ((newline && IS_WIDE_NEWLINE (wc))
 				      ? CONTEXT_NEWLINE : 0));
@@ -565,13 +710,8 @@ re_string_reconstruct (pstr, idx, eflags
 				      ? CONTEXT_NEWLINE : 0));
 	    }
 	}
-      if (!MBS_CASE_ALLOCATED (pstr))
-	{
-	  pstr->mbs_case += offset;
-	  /* In case of !MBS_ALLOCATED && !MBS_CASE_ALLOCATED.  */
-	  if (!MBS_ALLOCATED (pstr))
-	    pstr->mbs += offset;
-	}
+      if (!pstr->mbs_allocated)
+	pstr->mbs += offset;
     }
   pstr->raw_mbs_idx = idx;
   pstr->len -= offset;
@@ -582,7 +722,11 @@ re_string_reconstruct (pstr, idx, eflags
   if (pstr->mb_cur_max > 1)
     {
       if (pstr->icase)
-	build_wcs_upper_buffer (pstr);
+	{
+	  int ret = build_wcs_upper_buffer (pstr);
+	  if (BE (ret != REG_NOERROR, 0))
+	    return ret;
+	}
       else
 	build_wcs_buffer (pstr);
     }
@@ -601,17 +745,95 @@ re_string_reconstruct (pstr, idx, eflags
   return REG_NOERROR;
 }
 
+static unsigned char
+re_string_peek_byte_case (const re_string_t *pstr,
+			  int idx)
+{
+  int ch, off;
+
+  /* Handle the common (easiest) cases first.  */
+  if (BE (!pstr->icase, 1))
+    return re_string_peek_byte (pstr, idx);
+
+#ifdef RE_ENABLE_I18N
+  if (pstr->mb_cur_max > 1
+      && ! re_string_is_single_byte_char (pstr, pstr->cur_idx + idx))
+    return re_string_peek_byte (pstr, idx);
+#endif
+
+  off = pstr->cur_idx + idx;
+#ifdef RE_ENABLE_I18N
+  if (pstr->offsets_needed)
+    off = pstr->offsets[off];
+#endif
+
+  ch = pstr->raw_mbs[pstr->raw_mbs_idx + off];
+  if (pstr->trans)
+    ch = pstr->trans[ch];
+
+#ifdef RE_ENABLE_I18N
+  /* Ensure that e.g. for tr_TR.UTF-8 BACKSLASH DOTLESS SMALL LETTER I
+     this function returns CAPITAL LETTER I instead of first byte of
+     DOTLESS SMALL LETTER I.  The latter would confuse the parser,
+     since peek_byte_case doesn't advance cur_idx in any way.  */
+  if (pstr->offsets_needed && !isascii (ch))
+    return re_string_peek_byte (pstr, idx);
+#endif
+
+  return ch;
+}
+
+static unsigned char
+re_string_fetch_byte_case (re_string_t *pstr)
+{
+  int ch, off;
+
+  if (BE (!pstr->icase, 1))
+    return re_string_fetch_byte (pstr);
+
+#ifdef RE_ENABLE_I18N
+  if (pstr->offsets_needed)
+    {
+      /* For tr_TR.UTF-8 [[:islower:]] there is
+	 [[: CAPITAL LETTER I WITH DOT lower:]] in mbs.  Skip
+	 in that case the whole multi-byte character and return
+	 the original letter.  On the other side, with
+	 [[: DOTLESS SMALL LETTER I return [[:I, as doing
+	 anything else would complicate things too much.  */
+
+      if (!re_string_first_byte (pstr, pstr->cur_idx))
+	return re_string_fetch_byte (pstr);
+
+      off = pstr->offsets[pstr->cur_idx];
+      ch = pstr->raw_mbs[pstr->raw_mbs_idx + off];
+      if (pstr->trans)
+	ch = pstr->trans[ch];
+
+      if (! isascii (ch))
+	return re_string_fetch_byte (pstr);
+
+      re_string_skip_bytes (pstr,
+			    re_string_char_size_at (pstr, pstr->cur_idx));
+      return ch;
+    }
+#endif
+
+  ch = pstr->raw_mbs[pstr->raw_mbs_idx + pstr->cur_idx++];
+  if (pstr->trans)
+    ch = pstr->trans[ch];
+  return ch;
+}
+
 static void
 re_string_destruct (pstr)
      re_string_t *pstr;
 {
 #ifdef RE_ENABLE_I18N
   re_free (pstr->wcs);
+  re_free (pstr->offsets);
 #endif /* RE_ENABLE_I18N  */
-  if (MBS_ALLOCATED (pstr))
+  if (pstr->mbs_allocated)
     re_free (pstr->mbs);
-  if (MBS_CASE_ALLOCATED (pstr))
-    re_free (pstr->mbs_case);
 }
 
 /* Return the context at IDX in INPUT.  */
--- libc/posix/regcomp.c.jj	2003-11-26 09:25:16.000000000 +0100
+++ libc/posix/regcomp.c	2003-11-27 14:22:56.000000000 +0100
@@ -838,7 +838,7 @@ init_dfa (dfa, pat_len)
 
   dfa->mb_cur_max = MB_CUR_MAX;
 #ifdef _LIBC
-  if (dfa->mb_cur_max > 1
+  if (dfa->mb_cur_max == 6
       && strcmp (_NL_CURRENT (LC_CTYPE, _NL_CTYPE_CODESET_NAME), "UTF-8") == 0)
     dfa->is_utf8 = 1;
   dfa->map_notascii = (_NL_CURRENT_WORD (LC_CTYPE, _NL_CTYPE_MAP_TO_NONASCII)
--- libc/posix/regexec.c.jj	2003-11-26 15:40:13.000000000 +0100
+++ libc/posix/regexec.c	2003-11-27 17:36:46.000000000 +0100
@@ -609,6 +609,7 @@ re_search_internal (preg, string, length
   if (BE (err != REG_NOERROR, 0))
     goto free_return;
   input.stop = stop;
+  input.raw_stop = stop;
 
   err = match_ctx_init (&mctx, eflags, &input, dfa->nbackref * 2);
   if (BE (err != REG_NOERROR, 0))
@@ -703,7 +704,7 @@ re_search_internal (preg, string, length
 		     instead.  */
 		  /* If MATCH_FIRST is out of the valid range, reconstruct the
 		     buffers.  */
-		  if (input.raw_mbs_idx + input.valid_len <= match_first
+		  if (input.raw_mbs_idx + input.valid_raw_len <= match_first
 		      || match_first < input.raw_mbs_idx)
 		    {
 		      err = re_string_reconstruct (&input, match_first, eflags,
@@ -807,6 +808,17 @@ re_search_internal (preg, string, length
       for (reg_idx = 0; reg_idx < nmatch; ++reg_idx)
 	if (pmatch[reg_idx].rm_so != -1)
 	  {
+	    if (BE (input.offsets_needed != 0, 0))
+	      {
+		if (pmatch[reg_idx].rm_so == input.valid_len)
+		  pmatch[reg_idx].rm_so += input.valid_raw_len - input.valid_len;
+		else
+		  pmatch[reg_idx].rm_so = input.offsets[pmatch[reg_idx].rm_so];
+		if (pmatch[reg_idx].rm_eo == input.valid_len)
+		  pmatch[reg_idx].rm_eo += input.valid_raw_len - input.valid_len;
+		else
+		  pmatch[reg_idx].rm_eo = input.offsets[pmatch[reg_idx].rm_eo];
+	      }
 	    pmatch[reg_idx].rm_so += match_first;
 	    pmatch[reg_idx].rm_eo += match_first;
 	  }
@@ -3858,7 +3870,11 @@ extend_buffers (mctx)
     {
 #ifdef RE_ENABLE_I18N
       if (pstr->mb_cur_max > 1)
-	build_wcs_upper_buffer (pstr);
+	{
+	  ret = build_wcs_upper_buffer (pstr);
+	  if (BE (ret != REG_NOERROR, 0))
+	    return ret;
+	}
       else
 #endif /* RE_ENABLE_I18N  */
 	build_upper_buffer (pstr);
--- libc/posix/bug-regex18.c.jj	2003-11-12 07:06:30.000000000 +0100
+++ libc/posix/bug-regex18.c	2003-11-27 17:48:25.000000000 +0100
@@ -33,17 +33,23 @@ struct
   int flags, nmatch;
   regmatch_t rm[5];
 } tests[] = {
-  /* \xc4\xb0	LATIN CAPITAL LETTER I WITH DOT ABOVE
-     \xc4\xb1	LATIN SMALL LETTER DOTLESS I  */
-#if 0
-  /* XXX Not used since they fail so far.  */
+  /* \xc4\xb0		LATIN CAPITAL LETTER I WITH DOT ABOVE
+     \xc4\xb1		LATIN SMALL LETTER DOTLESS I
+     \xe2\x80\x94	EM DASH  */
   { "\xc4\xb0I*\xc4\xb1$", "aBi\xc4\xb1\xc4\xb1I", REG_ICASE, 2,
     { { 2, 8 }, { -1, -1 } } },
   { "[\xc4\xb0x]I*\xc4\xb1$", "aBi\xc4\xb1\xc4\xb1I", REG_ICASE, 2,
     { { 2, 8 }, { -1, -1 } } },
   { "[^x]I*\xc4\xb1$", "aBi\xc4\xb1\xc4\xb1I", REG_ICASE, 2,
-    { { 2, 8 }, { -1, -1 } } }
-#endif
+    { { 2, 8 }, { -1, -1 } } },
+  { "([[:alpha:]]i[[:xdigit:]])(\xc4\xb1*)(\xc4\xb0{2})",
+    "\xe2\x80\x94\xc4\xb1\xc4\xb0""fIi\xc4\xb0ii", REG_ICASE | REG_EXTENDED,
+    4, { { 3, 12 }, { 3, 8 }, { 8, 9 }, { 9, 12 } } },
+  { "\xc4\xb1i(i)*()(\\s\xc4\xb0|\\SI)", "SIi\xc4\xb0\xc4\xb0 is",
+    REG_ICASE | REG_EXTENDED, 4, { { 1, 9 }, { 5, 7 }, { 7, 7 }, { 7, 9 } } },
+  { "\xc4\xb1i(i)*()(\\s\xc4\xb0|\\SI)", "\xc4\xb1\xc4\xb0\xc4\xb0iJ\xc4\xb1",
+    REG_ICASE | REG_EXTENDED, 4,
+    { { 0, 10 }, { 6, 7 }, { 7, 7 }, { 7, 10 } } },
 };
 
 int

	Jakub
Follow-Ups:
- Re: [PATCH] Fix regex in tr_TR.UTF-8 locale
  - From: Ulrich Drepper
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]