This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale

From: Leonhard Holz <leonhard dot holz at web dot de>
To: GNU C Library <libc-alpha at sourceware dot org>
Cc: Carlos O'Donell <carlos at redhat dot com>
Date: Mon, 29 Feb 2016 08:53:20 +0100
Subject: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
Authentication-results: sourceware.org; auth=none
In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
regression. The cause of the problem is that

a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
sort weight which causes the comparison to check the whole string instead of
breaking up early and

b) the sequence-to-weight list is partitioned by the first byte of the first
character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
they tend to have an equal starting byte (e.g. all thai chars start with E0).

The approach of the patch is to interprete TABLEMB as a hashtable and find a
better hash key. My first try was to somehow "fold" a multibyte character into one
byte but that worsened the overall performance a lot. Enhancing the table to 2
byte keys works much better while needing a reasonable amount of extra memory.

The patch vastly improves the performance of languages with multibyte chars (see
zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
get a bit slower because of the extra check for the first byte while finding the right
sequence in the sequence list . It cannot be avoided since the hash key is not
longer equal to the first byte of the sequence. Tests are ok.

filelist#C			  1.75%		23,396,200	23,805,700
filelist#en_US.UTF-8		  1.42%		77,186,200	78,285,200
lorem_ipsum#vi_VN.UTF-8		 -1.70%		1,680,740	1,652,110
lorem_ipsum#ar_SA.UTF-8		 -7.71%		2,134,780	1,970,170
lorem_ipsum#en_US.UTF-8	 	  2.61%		1,685,120	1,729,160
lorem_ipsum#zh_CN.UTF-8		-88.66%		806,176		91,423
lorem_ipsum#cs_CZ.UTF-8		 -4.89%		2,150,120	2,045,030
lorem_ipsum#en_GB.UTF-8		 -1.47%		2,061,960	2,031,620
lorem_ipsum#da_DK.UTF-8		  3.15%		1,703,710	1,757,390
lorem_ipsum#pl_PL.UTF-8		  0.86%		1,634,890	1,648,870
lorem_ipsum#fr_FR.UTF-8		 -2.06%		2,232,030	2,186,030
lorem_ipsum#pt_PT.UTF-8		 -2.60%		2,238,410	2,180,210
lorem_ipsum#el_GR.UTF-8		-34.52%		3,413,330	2,235,010
lorem_ipsum#ru_RU.UTF-8		 -9.88%		2,403,370	2,165,950
lorem_ipsum#iw_IL.UTF-8		 -9.56%		2,209,740	1,998,500
lorem_ipsum#es_ES.UTF-8	 	  4.92%		1,983,470	2,081,050
lorem_ipsum#hi_IN.UTF-8		-98.88%		220,453,000	2,458,620
lorem_ipsum#sv_SE.UTF-8		  1.79%		1,645,370	1,674,760
lorem_ipsum#hu_HU.UTF-8		  4.86%		3,179,620	3,334,290
lorem_ipsum#tr_TR.UTF-8		-23.59%		2,473,330	1,889,870
lorem_ipsum#is_IS.UTF-8		  2.49%		1,620,370	1,660,680
lorem_ipsum#it_IT.UTF-8		 -2.67%		2,186,160	2,127,710
lorem_ipsum#sr_RS.UTF-8		  2.70%		1,930,520	1,982,720
lorem_ipsum#ja_JP.UTF-8		-97.43%		958,411		24,664
wikipedia-th#en_US.UTF-8	-99.61%		10,511,700,000	40,577,100

The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
performance of these languages too, though there is no benchmark for that.

Regarding Carlos comments:

>> +  struct element_t *mbheads[256 * 256];
>
> Use #define MBHEADS_SZ or something similar.

Ok.

>> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;
>
> OK.
>
> Will this always work? I'm just wondering about a user generated charmap that they
> call 'utf8', which is the other common alias for instance where the dash is not valid
> syntax. Probably not since the official name is UTF-8, and that's what you should use.

Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".

>> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.
>> +	     Also check the UTF-8 encoding.  Keep locale/weight.h in sync.  */
>
> Not OK. Can we refactor to avoid keeing the two in sync?

Ok, there is a new function utf8index in locale/weight.h that does the job.

>> @@ -2239,7 +2281,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>>
>>  		/* Compute how much space we will need.  */
>>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
>> -					  + 2 * (runp->nmbs - 1));
>> +					  + 2 * runp->nmbs);
>
> Doesn't the change to zero indexing make the conditional in the code above this wrong?
>
> e.g.
> 2230             if (runp->mbnext != NULL
> 2231                 && runp->nmbs == runp->mbnext->nmbs
> 2232                 && memcmp (runp->mbs, runp->mbnext->mbs, runp->nmbs - 1) == 0
> 2233                 && (runp->mbs[runp->nmbs - 1]
> 2234                     == runp->mbnext->mbs[runp->nmbs - 1] + 1))

No. runp traverses through the input / locale definition file and this is not affected by the change. What happens here
is a check if the next unicode literal has the same byte sequence as the current except for the last byte, which should
be 1 higher than the last byte of the current literal -> beginning of a sequence.


	* benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
	* benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
	* locale/localeinfo.h: Add enum collation_encoding_type.
	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
	* locale/programs/ld-collate.c (struct locale_collate_t):
	Expand mbheads array from 256 to 16384 entries.
	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
	(collate_output): Output larger table and sequences including first byte.
	(collate_output): Add encoding type info.
	* locale/weight.h (utf8index): New function to calculate 2 byte index.
	(findidx): Use 2-byte index for table if UTF-8 locale.
	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
	* posix/regcomp.c (build_equiv_class): Likewise.
	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
	* posix/regexec.c (check_node_accept_bytes): Likewise.
	* string/strcoll_l.c (get_next_seq): Likewise.
	(STRCOLL): Call get_next_seq with encoding parameter.
	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
	(STRXFRM): Call find_idx with encoding parameter.


diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c
index 22ae87c..6ce5b2a 100644
--- a/benchtests/bench-strcoll.c
+++ b/benchtests/bench-strcoll.c
@@ -53,7 +53,8 @@ static const char *const input_files[] = {
   "lorem_ipsum#is_IS.UTF-8",
   "lorem_ipsum#it_IT.UTF-8",
   "lorem_ipsum#sr_RS.UTF-8",
-  "lorem_ipsum#ja_JP.UTF-8"
+  "lorem_ipsum#ja_JP.UTF-8",
+  "wikipedia-th#en_US.UTF-8"
 };

 #define TEXTFILE_DELIMITER " \n\r\t.,?!"
diff --git a/locale/C-collate.c b/locale/C-collate.c
index 8214ff5..5a9ed6a 100644
--- a/locale/C-collate.c
+++ b/locale/C-collate.c
@@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
     /* _NL_COLLATE_COLLSEQWC */
     { .string = (const char *) collseqwc },
     /* _NL_COLLATE_CODESET */
-    { .string = _nl_C_codeset }
+    { .string = _nl_C_codeset },
+    /* _NL_COLLATE_ENCODING_TYPE */
+    { .word = __cet_8bit }
   }
 };
diff --git a/locale/categories.def b/locale/categories.def
index d8a3ab8..cb57eae 100644
--- a/locale/categories.def
+++ b/locale/categories.def
@@ -58,6 +58,7 @@ DEFINE_CATEGORY
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB,        "collate-collseqmb",        std, wstring)
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC,        "collate-collseqwc",        std, wstring)
   DEFINE_ELEMENT (_NL_COLLATE_CODESET,		"collate-codeset",	    std, string)
+  DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE,	"collate-encoding-type",    std, word)
   ), NO_POSTLOAD)


diff --git a/locale/langinfo.h b/locale/langinfo.h
index 481e226..0906a6a 100644
--- a/locale/langinfo.h
+++ b/locale/langinfo.h
@@ -255,6 +255,7 @@ enum
   _NL_COLLATE_COLLSEQMB,
   _NL_COLLATE_COLLSEQWC,
   _NL_COLLATE_CODESET,
+  _NL_COLLATE_ENCODING_TYPE,
   _NL_NUM_LC_COLLATE,

   /* LC_CTYPE category: character classification.
diff --git a/locale/localeinfo.h b/locale/localeinfo.h
index 5c4e6ef..bd284df 100644
--- a/locale/localeinfo.h
+++ b/locale/localeinfo.h
@@ -110,6 +110,14 @@ enum coll_sort_rule
   sort_mask
 };

+/* Collation encoding type.  */
+enum collation_encoding_type
+{
+  __cet_other,
+  __cet_8bit,
+  __cet_utf8
+};
+
 /* We can map the types of the entries into a few categories.  */
 enum value_type
 {
diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index 1e125f6..efaacf6 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -32,6 +32,8 @@
 #include "linereader.h"
 #include "locfile.h"
 #include "elem-hash.h"
+#include "../localeinfo.h"
+#include "../locale/weight.h"

 /* Uncomment the following line in the production version.  */
 /* #define NDEBUG 1 */
@@ -243,9 +245,10 @@ struct locale_collate_t
      Therefore we keep all relevant input in a list.  */
   struct locale_collate_t *next;

-  /* Arrays with heads of the list for each of the leading bytes in
+  /* Arrays with heads of the list for the leading bytes in
      the multibyte sequences.  */
-  struct element_t *mbheads[256];
+  #define MBHEADS_SZ (256 * 256)
+  struct element_t *mbheads[MBHEADS_SZ];

   /* Arrays with heads of the list for each of the leading bytes in
      the multibyte sequences.  */
@@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
   struct section_list *sect;
   int ruleidx;
   int nr_wide_elems = 0;
+  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

   if (collate == NULL)
     {
@@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
 	  struct element_t *lastp = NULL;

 	  /* Find the point where to insert in the list.  */
-	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
+	  uint16_t index = ((unsigned char *) runp->mbs)[0];
+
+	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.  */
+	  if (is_utf8 && index > 0)
+	    {
+	      index = utf8index((unsigned char *) runp->mbs, runp->nmbs);
+	      if (index == 0)
+		{
+		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
+						  _("\
+malformed UTF-8 character in `%s'"), runp->name););
+		  goto dont_insert;
+		}
+	    }
+
+	  eptr = &collate->mbheads[index];
 	  while (*eptr != NULL)
 	    {
 	      if ((*eptr)->nmbs < runp->nmbs)
@@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name);

   /* Find out whether any of the `mbheads' entries is unset.  In this
      case we use the UNDEFINED entry.  */
-  for (i = 1; i < 256; ++i)
+  for (i = 1; i < MBHEADS_SZ; ++i)
     if (collate->mbheads[i] == NULL)
       {
 	need_undefined = 1;
@@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
   struct locale_file file;
   size_t ch;
-  int32_t tablemb[256];
+  int32_t tablemb[MBHEADS_SZ];
   struct obstack weightpool;
   struct obstack extrapool;
   struct obstack indirectpool;
@@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	  /* The words have to be handled specially.  */
 	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
 	    add_locale_uint32 (&file, 0);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
+	    add_locale_uint32 (&file, __cet_other);
 	  else
 	    add_locale_empty (&file);
 	}
@@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   if (collate->undefined.used_in_level != 0)
     output_weight (&weightpool, collate, &collate->undefined);

-  for (ch = 1; ch < 256; ++ch)
+  for (ch = 1; ch < MBHEADS_SZ; ++ch)
     if (collate->mbheads[ch]->mbnext == NULL
 	&& collate->mbheads[ch]->nmbs <= 1)
       {
@@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	   and add only one index into the weight table.  We can find the
 	   consecutive entries since they are also consecutive in the list.  */
 	struct element_t *runp = collate->mbheads[ch];
-	struct element_t *lastp;

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

@@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* Compute how much space we will need.  */
 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + 2 * (runp->nmbs - 1));
+					  + 2 * runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

@@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		/* Now walk backward from here to the beginning.  */
 		curp = runp;

-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
-		for (i = 1; i < curp->nmbs; ++i)
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, curp->nmbs);
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);

 		/* Now find the end of the consecutive sequence and
@@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* And add the end byte sequence.  Without length this
 		   time.  */
-		for (i = 1; i < curp->nmbs; ++i)
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
 	      }
 	    else
@@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		weightidx = output_weight (&weightpool, collate, runp);

 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + runp->nmbs - 1);
+					  + runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

 		obstack_int32_grow_fast (&extrapool, weightidx);
-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, runp->nmbs);

-		for (i = 1; i < runp->nmbs; ++i)
+		for (i = 0; i < runp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
 	      }

@@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	      obstack_1grow_fast (&extrapool, '\0');

 	    /* Next entry.  */
-	    lastp = runp;
 	    runp = runp->mbnext;
 	  }
 	while (runp != NULL);

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

-	/* If the final entry in the list is not a single character we
-	   add an UNDEFINED entry here.  */
-	if (lastp->nmbs != 1)
-	  {
-	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
-	    obstack_make_room (&extrapool, added);
+	/* Add an UNDEFINED entry at the end of the list.  */
+	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
+	obstack_make_room (&extrapool, added);

-	    obstack_int32_grow_fast (&extrapool, 0);
-	    /* XXX What rule? We just pick the first.  */
-	    obstack_1grow_fast (&extrapool, 0);
-	    /* Length is zero.  */
-	    obstack_1grow_fast (&extrapool, 0);
+	obstack_int32_grow_fast (&extrapool, 0);
+	/* XXX What rule? We just pick the first.  */
+	obstack_1grow_fast (&extrapool, 0);
+	/* Length is zero.  */
+	obstack_1grow_fast (&extrapool, 0);

-	    /* Add alignment bytes if necessary.  */
-	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
-	      obstack_1grow_fast (&extrapool, '\0');
-	  }
+	/* Add alignment bytes if necessary.  */
+	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
+	  obstack_1grow_fast (&extrapool, '\0');
       }

   /* Add padding to the tables if necessary.  */
@@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
     obstack_1grow (&weightpool, 0);

   /* Now add the four tables.  */
-  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
+  add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ);
   add_locale_raw_obstack (&file, &weightpool);
   add_locale_raw_obstack (&file, &extrapool);
   add_locale_raw_obstack (&file, &indirectpool);
@@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   add_locale_raw_data (&file, collate->mbseqorder, 256);
   add_locale_collseq_table (&file, &collate->wcseqorder);
   add_locale_string (&file, charmap->code_set_name);
+  if (strcmp (charmap->code_set_name, "UTF-8") == 0)
+    add_locale_uint32 (&file, __cet_utf8);
+  else if (charmap->mb_cur_max == 1)
+    add_locale_uint32 (&file, __cet_8bit);
+  else
+    add_locale_uint32 (&file, __cet_other);
   write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);

   obstack_free (&weightpool, NULL);
diff --git a/locale/weight.h b/locale/weight.h
index c99730c..5b4103b 100644
--- a/locale/weight.h
+++ b/locale/weight.h
@@ -19,26 +19,81 @@
 #ifndef _WEIGHT_H_
 #define _WEIGHT_H_	1

+/* Generate 2 byte code for the next UTF-8 encoded char.
+   Returns zero on UTF-8 encoding errors.  */
+static __always_inline uint16_t
+utf8index (const unsigned char *cp, size_t len)
+{
+  uint16_t index = cp[0];
+
+  if (index >= 0x80)
+    {
+      if (index < 0xE0)
+	{
+	  if (len < 2)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  index = (index << 6) + byte2 - 0x3080;
+	}
+      else if (index < 0xF0)
+	{
+	  if (len < 3)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  uint16_t byte3 = cp[2];
+	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
+	}
+      else if (index < 0xF8)
+	{
+	  if (len < 4)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  uint16_t byte3 = cp[2];
+	  uint16_t byte4 = cp[3];
+	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
+	}
+      else
+	return 0;
+    }
+
+  return index;
+}
+
 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t locale_encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const unsigned char *extra,
 	 const unsigned char **cpp, size_t len)
 {
-  int_fast32_t i = table[*(*cpp)++];
   const unsigned char *cp;
   const unsigned char *usrc;
+  uint16_t index = (*cpp)[0];
+
+  /* Special handling of UTF-8: Generate a 2-byte index for table.  */
+  if (index >= 0x80 && locale_encoding == __cet_utf8)
+    {
+      index = utf8index(*cpp, len);
+      if (index == 0)
+	{
+	  *cpp += 1;
+	  return 0;
+	}
+    }

+  int_fast32_t i = table[index];
   if (i >= 0)
-    /* This is an index into the weight table.  Cool.  */
-    return i;
+    {
+      /* This is an index into the weight table.  Cool.  */
+      *cpp += 1;
+      return i;
+    }

   /* Oh well, more than one sequence starting with this byte.
      Search for the correct one.  */
   cp = &extra[-i];
   usrc = *cpp;
-  --len;
   while (1)
     {
       size_t nhere;
@@ -57,8 +112,7 @@ findidx (const int32_t *table,
 	  /* It is a single character.  If it matches we found our
 	     index.  Note that at the end of each list there is an
 	     entry of length zero which represents the single byte
-	     sequence.  The first (and here only) byte was tested
-	     already.  */
+	     sequence.  */
 	  size_t cnt;

 	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
@@ -68,7 +122,7 @@ findidx (const int32_t *table,
 	  if (cnt == nhere)
 	    {
 	      /* Found it.  */
-	      *cpp += nhere;
+	      *cpp += nhere > 0 ? nhere : 1;
 	      return i;
 	    }

@@ -127,7 +181,7 @@ findidx (const int32_t *table,
 	      while (++cnt < nhere);
 	    }

-	  *cpp += nhere;
+	  *cpp += nhere > 0 ? nhere : 1;
 	  return indirect[-i + offset];
 	}
     }
diff --git a/locale/weightwc.h b/locale/weightwc.h
index ab26482..4101dc8 100644
--- a/locale/weightwc.h
+++ b/locale/weightwc.h
@@ -21,7 +21,8 @@

 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const wint_t *extra,
 	 const wint_t **cpp, size_t len)
diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
index 229904e..07b60fb 100644
--- a/posix/fnmatch_loop.c
+++ b/posix/fnmatch_loop.c
@@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			const int32_t *indirect;
 			int32_t idx;
 			const UCHAR *cp = (const UCHAR *) &str;
+			uint_fast32_t encoding =
+			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);

 # if WIDE_CHAR_VERSION
 			table = (const int32_t *)
@@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
 # endif

-			idx = FINDIDX (table, indirect, extra, &cp, 1);
+			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
 			if (idx != 0)
 			  {
 			    /* We found a table entry.  Now see whether the
@@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			    int32_t idx2;
 			    const UCHAR *np = (const UCHAR *) n;

-			    idx2 = FINDIDX (table, indirect, extra,
+			    idx2 = FINDIDX (encoding, table, indirect, extra,
 					    &np, string_end - n);
 			    if (idx2 != 0
 				&& (idx >> 24) == (idx2 >> 24)
diff --git a/posix/regcomp.c b/posix/regcomp.c
index b6126b7..011ef92 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
   uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
   if (nrules != 0)
     {
+      uint_fast32_t encoding;
       const int32_t *table, *indirect;
       const unsigned char *weights, *extra, *cp;
       unsigned char char_buf[2];
@@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
       size_t len;
       /* Calculate the index for equivalence class.  */
       cp = name;
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
 					       _NL_COLLATE_WEIGHTMB);
@@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
 						   _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
-      idx1 = findidx (table, indirect, extra, &cp, -1);
+      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
       if (BE (idx1 == 0 || *cp != '\0', 0))
 	/* This isn't a valid character.  */
 	return REG_ECOLLATE;
@@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
 	{
 	  char_buf[0] = ch;
 	  cp = char_buf;
-	  idx2 = findidx (table, indirect, extra, &cp, 1);
+	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
 /*
 	  idx2 = table[ch];
 */
diff --git a/posix/regex_internal.h b/posix/regex_internal.h
index 02e040b..993c7c3 100644
--- a/posix/regex_internal.h
+++ b/posix/regex_internal.h
@@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx)
 #  ifdef _LIBC
   const unsigned char *p, *extra;
   const int32_t *table, *indirect;
+  uint_fast32_t encoding;
   uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);

   if (nrules != 0)
     {
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       extra = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
       p = pstr->mbs + idx;
-      findidx (table, indirect, extra, &p, pstr->len - idx);
+      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
       return p - pstr->mbs - idx;
     }
   else
diff --git a/posix/regexec.c b/posix/regexec.c
index ec46c3a..3d3ad9a 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
       if (nrules != 0)
 	{
 	  unsigned int in_collseq = 0;
+	  uint_fast32_t encoding;
 	  const int32_t *table, *indirect;
 	  const unsigned char *weights, *extra;
 	  const char *collseqwc;
@@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 	  if (cset->nequiv_classes)
 	    {
 	      const unsigned char *cp = pin;
+	      encoding =
+		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
 	      table = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
 	      weights = (const unsigned char *)
@@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
 	      indirect = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
-	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
+	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
+				     elem_len);
 	      if (idx > 0)
 		for (i = 0; i < cset->nequiv_classes; ++i)
 		  {
diff --git a/string/strcoll_l.c b/string/strcoll_l.c
index 4d1e3ab..2c2cab0 100644
--- a/string/strcoll_l.c
+++ b/string/strcoll_l.c
@@ -63,9 +63,9 @@ typedef struct
 /* Get next sequence.  Traverse the string as required.  */
 static __always_inline void
 get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
-	      const USTRING_TYPE *weights, const int32_t *table,
-	      const USTRING_TYPE *extra, const int32_t *indirect,
-	      int pass)
+	      const USTRING_TYPE *weights, uint_fast32_t encoding,
+	      const int32_t *table, const USTRING_TYPE *extra,
+	      const int32_t *indirect, int pass)
 {
   size_t val = seq->val = 0;
   int len = seq->len;
@@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
 	      us = seq->back_us;
 	      while (i < backw)
 		{
-		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
+		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 		  idx = tmp & 0xffffff;
 		  i++;
 		}
@@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,

 	  while (*us != L('\0'))
 	    {
-	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
+	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 	      unsigned char rule = tmp >> 24;
 	      prev_idx = idx;
 	      idx = tmp & 0xffffff;
@@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
   const USTRING_TYPE *weights;
   const USTRING_TYPE *extra;
   const int32_t *indirect;
+  uint_fast32_t encoding;

   if (nrules == 0)
     return STRCMP (s1, s2);
@@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string;
   indirect = (const int32_t *)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string;
+  encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+

   assert (((uintptr_t) table) % __alignof__ (table[0]) == 0);
   assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0);
@@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)

       while (1)
 	{
-	  get_next_seq (&seq1, nrules, rulesets, weights, table,
+	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
-	  get_next_seq (&seq2, nrules, rulesets, weights, table,
+	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
 	  /* See whether any or both strings are empty.  */
 	  if (seq1.len == 0 || seq2.len == 0)
diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
index 22e24d3..5c89b15 100644
--- a/string/strxfrm_l.c
+++ b/string/strxfrm_l.c
@@ -53,6 +53,7 @@ typedef struct
   uint_fast32_t nrules;
   unsigned char *rulesets;
   USTRING_TYPE *weights;
+  uint_fast32_t encoding;
   int32_t *table;
   USTRING_TYPE *extra;
   int32_t *indirect;
@@ -100,8 +101,8 @@ static __always_inline size_t
 find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
 	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
 {
-  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
-			 -1);
+  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
+			 l_data->extra, us, -1);
   *rule_idx = tmp >> 24;
   int32_t idx = tmp & 0xffffff;
   size_t len = l_data->weights[idx++];
@@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
   /* Get the locale data.  */
   l_data.rulesets = (unsigned char *)
     current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
+  l_data.encoding =
+    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
   l_data.table = (int32_t *)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
   l_data.weights = (USTRING_TYPE *)
@@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)

   do
     {
-      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
-			     -1);
+      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
+			     l_data.extra, &cur, -1);
       rulearr[idxmax] = tmp >> 24;
       idxarr[idxmax] = tmp & 0xffffff;
Attachment: wikipedia-th#en_US.UTF-8
Description: Text document
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]