This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: Carlos O'Donell <carlos at redhat dot com>
- To: Leonhard Holz <leonhard dot holz at web dot de>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Tue, 29 Mar 2016 12:12:01 -0400
- Subject: Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <56D3F8F0 dot 8070401 at web dot de>
On 02/29/2016 02:53 AM, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
I think we need a V5 with a few more changes as noted below:
- More comments.
- Change signature of utf8index.
- Probably one more define for magic 255 constant.
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
OK.
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
Agreed.
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
How much memory on average?
Worst case is all possible 2-byte hash values?
e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
> get a bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
Can we use UTF-8-specific knowledge to accelerate the lookup?
For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
Could you do two checks?
(a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
(b) If in a 1 byte sequence use a one-byte table.
(c) If in a 2-6 byte sequence use the hash-table?
> filelist#C 1.75% 23,396,200 23,805,700
> filelist#en_US.UTF-8 1.42% 77,186,200 78,285,200
> lorem_ipsum#vi_VN.UTF-8 -1.70% 1,680,740 1,652,110
> lorem_ipsum#ar_SA.UTF-8 -7.71% 2,134,780 1,970,170
> lorem_ipsum#en_US.UTF-8 2.61% 1,685,120 1,729,160
> lorem_ipsum#zh_CN.UTF-8 -88.66% 806,176 91,423
> lorem_ipsum#cs_CZ.UTF-8 -4.89% 2,150,120 2,045,030
> lorem_ipsum#en_GB.UTF-8 -1.47% 2,061,960 2,031,620
> lorem_ipsum#da_DK.UTF-8 3.15% 1,703,710 1,757,390
> lorem_ipsum#pl_PL.UTF-8 0.86% 1,634,890 1,648,870
> lorem_ipsum#fr_FR.UTF-8 -2.06% 2,232,030 2,186,030
> lorem_ipsum#pt_PT.UTF-8 -2.60% 2,238,410 2,180,210
> lorem_ipsum#el_GR.UTF-8 -34.52% 3,413,330 2,235,010
> lorem_ipsum#ru_RU.UTF-8 -9.88% 2,403,370 2,165,950
> lorem_ipsum#iw_IL.UTF-8 -9.56% 2,209,740 1,998,500
> lorem_ipsum#es_ES.UTF-8 4.92% 1,983,470 2,081,050
> lorem_ipsum#hi_IN.UTF-8 -98.88% 220,453,000 2,458,620
> lorem_ipsum#sv_SE.UTF-8 1.79% 1,645,370 1,674,760
> lorem_ipsum#hu_HU.UTF-8 4.86% 3,179,620 3,334,290
> lorem_ipsum#tr_TR.UTF-8 -23.59% 2,473,330 1,889,870
> lorem_ipsum#is_IS.UTF-8 2.49% 1,620,370 1,660,680
> lorem_ipsum#it_IT.UTF-8 -2.67% 2,186,160 2,127,710
> lorem_ipsum#sr_RS.UTF-8 2.70% 1,930,520 1,982,720
> lorem_ipsum#ja_JP.UTF-8 -97.43% 958,411 24,664
> wikipedia-th#en_US.UTF-8 -99.61% 10,511,700,000 40,577,100
>
> The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
> the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
> of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
> performance of these languages too, though there is no benchmark for that.
>> Will this always work? I'm just wondering about a user generated charmap that they
>> call 'utf8', which is the other common alias for instance where the dash is not valid
>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>
> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
Could you check to see what value 'code_set_name' uses internally?
Is it always 'UTF-8'? If it is, then the check you have is just fine.
Otherwise we should check for utf8 and UTF-8.
We don't know until you verify the values code_set_name can have.
>
> * benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
> * benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
> * locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
> * locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
> * locale/localeinfo.h: Add enum collation_encoding_type.
> * locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
> * locale/programs/ld-collate.c (struct locale_collate_t):
> Expand mbheads array from 256 to 16384 entries.
> (collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
> (collate_output): Output larger table and sequences including first byte.
> (collate_output): Add encoding type info.
> * locale/weight.h (utf8index): New function to calculate 2 byte index.
> (findidx): Use 2-byte index for table if UTF-8 locale.
> * locale/weightwc.h (findidx): Accept encoding parameter, not used.
> * posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
> * posix/regcomp.c (build_equiv_class): Likewise.
> * posix/regex_internal.h (re_string_elem_size_at): Likewise.
> * posix/regexec.c (check_node_accept_bytes): Likewise.
> * string/strcoll_l.c (get_next_seq): Likewise.
> (STRCOLL): Call get_next_seq with encoding parameter.
> * string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
> (STRXFRM): Call find_idx with encoding parameter.
>
>
> diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c
> index 22ae87c..6ce5b2a 100644
> --- a/benchtests/bench-strcoll.c
> +++ b/benchtests/bench-strcoll.c
> @@ -53,7 +53,8 @@ static const char *const input_files[] = {
> "lorem_ipsum#is_IS.UTF-8",
> "lorem_ipsum#it_IT.UTF-8",
> "lorem_ipsum#sr_RS.UTF-8",
> - "lorem_ipsum#ja_JP.UTF-8"
> + "lorem_ipsum#ja_JP.UTF-8",
> + "wikipedia-th#en_US.UTF-8"
OK.
> };
>
> #define TEXTFILE_DELIMITER " \n\r\t.,?!"
> diff --git a/locale/C-collate.c b/locale/C-collate.c
> index 8214ff5..5a9ed6a 100644
> --- a/locale/C-collate.c
> +++ b/locale/C-collate.c
> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
> /* _NL_COLLATE_COLLSEQWC */
> { .string = (const char *) collseqwc },
> /* _NL_COLLATE_CODESET */
> - { .string = _nl_C_codeset }
> + { .string = _nl_C_codeset },
> + /* _NL_COLLATE_ENCODING_TYPE */
> + { .word = __cet_8bit }
This makes locale-archive incompatible again right?
Users have to regenerate the locale-archive after the upgrade?
We need a release note for that under "Packaging Changes"
https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
The release note should mention the binary locale-archive format
has changed and that locale-archive must be removed before upgrading
and then recompiled after upgrade.
> }
> };
> diff --git a/locale/categories.def b/locale/categories.def
> index d8a3ab8..cb57eae 100644
> --- a/locale/categories.def
> +++ b/locale/categories.def
> @@ -58,6 +58,7 @@ DEFINE_CATEGORY
> DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB, "collate-collseqmb", std, wstring)
> DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC, "collate-collseqwc", std, wstring)
> DEFINE_ELEMENT (_NL_COLLATE_CODESET, "collate-codeset", std, string)
> + DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE, "collate-encoding-type", std, word)
OK.
> ), NO_POSTLOAD)
>
>
> diff --git a/locale/langinfo.h b/locale/langinfo.h
> index 481e226..0906a6a 100644
> --- a/locale/langinfo.h
> +++ b/locale/langinfo.h
> @@ -255,6 +255,7 @@ enum
> _NL_COLLATE_COLLSEQMB,
> _NL_COLLATE_COLLSEQWC,
> _NL_COLLATE_CODESET,
> + _NL_COLLATE_ENCODING_TYPE,
OK.
> _NL_NUM_LC_COLLATE,
>
> /* LC_CTYPE category: character classification.
> diff --git a/locale/localeinfo.h b/locale/localeinfo.h
> index 5c4e6ef..bd284df 100644
> --- a/locale/localeinfo.h
> +++ b/locale/localeinfo.h
> @@ -110,6 +110,14 @@ enum coll_sort_rule
> sort_mask
> };
>
> +/* Collation encoding type. */
> +enum collation_encoding_type
> +{
> + __cet_other,
> + __cet_8bit,
> + __cet_utf8
> +};
OK.
> +
> /* We can map the types of the entries into a few categories. */
> enum value_type
> {
> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
> index 1e125f6..efaacf6 100644
> --- a/locale/programs/ld-collate.c
> +++ b/locale/programs/ld-collate.c
> @@ -32,6 +32,8 @@
> #include "linereader.h"
> #include "locfile.h"
> #include "elem-hash.h"
> +#include "../localeinfo.h"
> +#include "../locale/weight.h"
OK.
>
> /* Uncomment the following line in the production version. */
> /* #define NDEBUG 1 */
> @@ -243,9 +245,10 @@ struct locale_collate_t
> Therefore we keep all relevant input in a list. */
> struct locale_collate_t *next;
>
> - /* Arrays with heads of the list for each of the leading bytes in
> + /* Arrays with heads of the list for the leading bytes in
> the multibyte sequences. */
> - struct element_t *mbheads[256];
Needs a comment explaining why this is '256*256'.
> + #define MBHEADS_SZ (256 * 256)
> + struct element_t *mbheads[MBHEADS_SZ];
OK.
>
> /* Arrays with heads of the list for each of the leading bytes in
> the multibyte sequences. */
> @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
> struct section_list *sect;
> int ruleidx;
> int nr_wide_elems = 0;
> + bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;
As discussed above you need to verify that code_set_name is always "UTF-8" if it's
a UTF-8 charmap. If it is always "UTF-8" then this check is just fine. I will note
that `locale -a -v` always prints 'UTF-8' for codeset and never 'utf8', but the
printed name is still 'C.utf8' or 'es_EC.utf8' etc because glibc normalizes the name
internally (downcase and strip dashes e.g. pt_PT.iso88591).
I suggest just putting a breakpoint there and verify that it's UTF-8 for a few
locale names.
You could then avoid strcmp:
bool is_utf8 == strlen (charmap->code_set_name) == 5
? (charmap->code_set_name[0] == 'U'
&& charmap->code_set_name[1] == 'T'
&& charmap->code_set_name[2] == 'F'
&& charmap->code_set_name[3] == '-'
&& charmap->code_set_name[4] == '8')
: 0;
Since more than 5 chars means it's not 'UTF-8' e.g. BIG5-HKSCS, or ISO-8859-1.
The only locales with 5 or less are GBK and PT154. So you could skip the strcmp
in all cases, but still need a comparison to disambiguate PT154 vs. UTF-8.
>
> if (collate == NULL)
> {
> @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
> struct element_t *lastp = NULL;
>
> /* Find the point where to insert in the list. */
> - eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
> + uint16_t index = ((unsigned char *) runp->mbs)[0];
> +
> + /* Special handling of UTF-8: Generate a 2-byte index to mbheads. */
> + if (is_utf8 && index > 0)
> + {
> + index = utf8index((unsigned char *) runp->mbs, runp->nmbs);
> + if (index == 0)
> + {
> + WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
> + _("\
> +malformed UTF-8 character in `%s'"), runp->name););
> + goto dont_insert;
> + }
> + }
> +
> + eptr = &collate->mbheads[index];
OK.
> while (*eptr != NULL)
> {
> if ((*eptr)->nmbs < runp->nmbs)
> @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name);
>
> /* Find out whether any of the `mbheads' entries is unset. In this
> case we use the UNDEFINED entry. */
> - for (i = 1; i < 256; ++i)
> + for (i = 1; i < MBHEADS_SZ; ++i)
OK.
> if (collate->mbheads[i] == NULL)
> {
> need_undefined = 1;
> @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
> struct locale_file file;
> size_t ch;
> - int32_t tablemb[256];
> + int32_t tablemb[MBHEADS_SZ];
OK.
> struct obstack weightpool;
> struct obstack extrapool;
> struct obstack indirectpool;
> @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> /* The words have to be handled specially. */
> if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
> add_locale_uint32 (&file, 0);
> + else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
> + add_locale_uint32 (&file, __cet_other);
OK.
> else
> add_locale_empty (&file);
> }
> @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> if (collate->undefined.used_in_level != 0)
> output_weight (&weightpool, collate, &collate->undefined);
>
> - for (ch = 1; ch < 256; ++ch)
> + for (ch = 1; ch < MBHEADS_SZ; ++ch)
OK.
> if (collate->mbheads[ch]->mbnext == NULL
> && collate->mbheads[ch]->nmbs <= 1)
> {
> @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> and add only one index into the weight table. We can find the
> consecutive entries since they are also consecutive in the list. */
> struct element_t *runp = collate->mbheads[ch];
> - struct element_t *lastp;
>
> assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>
> @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>
> /* Compute how much space we will need. */
> added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> - + 2 * (runp->nmbs - 1));
> + + 2 * runp->nmbs);
> assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> obstack_make_room (&extrapool, added);
>
> @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> /* Now walk backward from here to the beginning. */
> curp = runp;
>
> - assert (runp->nmbs <= 256);
> - obstack_1grow_fast (&extrapool, curp->nmbs - 1);
> - for (i = 1; i < curp->nmbs; ++i)
> + assert (runp->nmbs <= 255);
Another magic constant. Where does 256 or 255 come from?
These are struct element_t structures with entries chained from there.
Why do we limit them to 255? When might they be larger?
We should use a define here and document why the limit exists.
> + obstack_1grow_fast (&extrapool, curp->nmbs);
> + for (i = 0; i < curp->nmbs; ++i)
OK.
> obstack_1grow_fast (&extrapool, curp->mbs[i]);
>
> /* Now find the end of the consecutive sequence and
> @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>
> /* And add the end byte sequence. Without length this
> time. */
> - for (i = 1; i < curp->nmbs; ++i)
> + for (i = 0; i < curp->nmbs; ++i)
> obstack_1grow_fast (&extrapool, curp->mbs[i]);
> }
> else
> @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> weightidx = output_weight (&weightpool, collate, runp);
>
> added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> - + runp->nmbs - 1);
> + + runp->nmbs);
> assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> obstack_make_room (&extrapool, added);
>
> obstack_int32_grow_fast (&extrapool, weightidx);
> - assert (runp->nmbs <= 256);
> - obstack_1grow_fast (&extrapool, runp->nmbs - 1);
> + assert (runp->nmbs <= 255);
> + obstack_1grow_fast (&extrapool, runp->nmbs);
>
> - for (i = 1; i < runp->nmbs; ++i)
> + for (i = 0; i < runp->nmbs; ++i)
> obstack_1grow_fast (&extrapool, runp->mbs[i]);
> }
>
> @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> obstack_1grow_fast (&extrapool, '\0');
>
> /* Next entry. */
> - lastp = runp;
> runp = runp->mbnext;
> }
> while (runp != NULL);
>
> assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>
> - /* If the final entry in the list is not a single character we
> - add an UNDEFINED entry here. */
> - if (lastp->nmbs != 1)
> - {
> - int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> - obstack_make_room (&extrapool, added);
> + /* Add an UNDEFINED entry at the end of the list. */
> + int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> + obstack_make_room (&extrapool, added);
>
> - obstack_int32_grow_fast (&extrapool, 0);
> - /* XXX What rule? We just pick the first. */
> - obstack_1grow_fast (&extrapool, 0);
> - /* Length is zero. */
> - obstack_1grow_fast (&extrapool, 0);
> + obstack_int32_grow_fast (&extrapool, 0);
> + /* XXX What rule? We just pick the first. */
> + obstack_1grow_fast (&extrapool, 0);
> + /* Length is zero. */
> + obstack_1grow_fast (&extrapool, 0);
>
> - /* Add alignment bytes if necessary. */
> - while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> - obstack_1grow_fast (&extrapool, '\0');
> - }
> + /* Add alignment bytes if necessary. */
> + while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> + obstack_1grow_fast (&extrapool, '\0');
> }
>
> /* Add padding to the tables if necessary. */
> @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> obstack_1grow (&weightpool, 0);
>
> /* Now add the four tables. */
> - add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
> + add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ);
> add_locale_raw_obstack (&file, &weightpool);
> add_locale_raw_obstack (&file, &extrapool);
> add_locale_raw_obstack (&file, &indirectpool);
> @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> add_locale_raw_data (&file, collate->mbseqorder, 256);
> add_locale_collseq_table (&file, &collate->wcseqorder);
> add_locale_string (&file, charmap->code_set_name);
> + if (strcmp (charmap->code_set_name, "UTF-8") == 0)
Similar discussion as above regarding UTF-8 and optimziation.
> + add_locale_uint32 (&file, __cet_utf8);
> + else if (charmap->mb_cur_max == 1)
> + add_locale_uint32 (&file, __cet_8bit);
> + else
> + add_locale_uint32 (&file, __cet_other);
> write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);
>
> obstack_free (&weightpool, NULL);
> diff --git a/locale/weight.h b/locale/weight.h
> index c99730c..5b4103b 100644
> --- a/locale/weight.h
> +++ b/locale/weight.h
> @@ -19,26 +19,81 @@
> #ifndef _WEIGHT_H_
> #define _WEIGHT_H_ 1
>
> +/* Generate 2 byte code for the next UTF-8 encoded char.
> + Returns zero on UTF-8 encoding errors. */
Plase change the prototype to be:
int utf8index (const unsigned char *cp, size_t len, uint16_t *index);
Return 0 on success.
Return !0 on error.
Store index into *index.
This way we avoid conflating index value 0 and error.
> +static __always_inline uint16_t
> +utf8index (const unsigned char *cp, size_t len)
> +{
> + uint16_t index = cp[0];
> +
> + if (index >= 0x80)
> + {
> + if (index < 0xE0)
> + {
> + if (len < 2)
> + return 0;
> + uint16_t byte2 = cp[1];
> + index = (index << 6) + byte2 - 0x3080;
> + }
> + else if (index < 0xF0)
> + {
> + if (len < 3)
> + return 0;
> + uint16_t byte2 = cp[1];
> + uint16_t byte3 = cp[2];
> + index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
> + }
> + else if (index < 0xF8)
> + {
> + if (len < 4)
> + return 0;
> + uint16_t byte2 = cp[1];
> + uint16_t byte3 = cp[2];
> + uint16_t byte4 = cp[3];
> + index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
I believe this is technically inaccurate since it allows all 4-byte
sequences, when in reality the limit is at U+10FFFF?
You need not fix it, but we should add a comment saying that for the
sake of simpler code we're allowing those 4-byte sequences which are
not normally accepted.
> + }
> + else
> + return 0;
OK.
> + }
> +
> + return index;
> +}
> +
> /* Find index of weight. */
> static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t locale_encoding,
> + const int32_t *table,
> const int32_t *indirect,
> const unsigned char *extra,
> const unsigned char **cpp, size_t len)
> {
> - int_fast32_t i = table[*(*cpp)++];
> const unsigned char *cp;
> const unsigned char *usrc;
> + uint16_t index = (*cpp)[0];
> +
> + /* Special handling of UTF-8: Generate a 2-byte index for table. */
> + if (index >= 0x80 && locale_encoding == __cet_utf8)
> + {
> + index = utf8index(*cpp, len);
> + if (index == 0)
> + {
> + *cpp += 1;
> + return 0;
> + }
> + }
OK.
>
> + int_fast32_t i = table[index];
> if (i >= 0)
> - /* This is an index into the weight table. Cool. */
> - return i;
> + {
> + /* This is an index into the weight table. Cool. */
> + *cpp += 1;
> + return i;
> + }
>
> /* Oh well, more than one sequence starting with this byte.
> Search for the correct one. */
> cp = &extra[-i];
> usrc = *cpp;
> - --len;
> while (1)
> {
> size_t nhere;
> @@ -57,8 +112,7 @@ findidx (const int32_t *table,
> /* It is a single character. If it matches we found our
> index. Note that at the end of each list there is an
> entry of length zero which represents the single byte
> - sequence. The first (and here only) byte was tested
> - already. */
> + sequence. */
> size_t cnt;
>
> for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
> @@ -68,7 +122,7 @@ findidx (const int32_t *table,
> if (cnt == nhere)
> {
> /* Found it. */
> - *cpp += nhere;
> + *cpp += nhere > 0 ? nhere : 1;
> return i;
> }
>
> @@ -127,7 +181,7 @@ findidx (const int32_t *table,
> while (++cnt < nhere);
> }
>
> - *cpp += nhere;
> + *cpp += nhere > 0 ? nhere : 1;
> return indirect[-i + offset];
> }
> }
OK.
> diff --git a/locale/weightwc.h b/locale/weightwc.h
> index ab26482..4101dc8 100644
> --- a/locale/weightwc.h
> +++ b/locale/weightwc.h
> @@ -21,7 +21,8 @@
>
> /* Find index of weight. */
> static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t encoding,
> + const int32_t *table,
> const int32_t *indirect,
> const wint_t *extra,
> const wint_t **cpp, size_t len)
> diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
> index 229904e..07b60fb 100644
> --- a/posix/fnmatch_loop.c
> +++ b/posix/fnmatch_loop.c
> @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
> const int32_t *indirect;
> int32_t idx;
> const UCHAR *cp = (const UCHAR *) &str;
> + uint_fast32_t encoding =
> + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>
> # if WIDE_CHAR_VERSION
> table = (const int32_t *)
> @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
> _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
> # endif
>
> - idx = FINDIDX (table, indirect, extra, &cp, 1);
> + idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
> if (idx != 0)
> {
> /* We found a table entry. Now see whether the
> @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
> int32_t idx2;
> const UCHAR *np = (const UCHAR *) n;
>
> - idx2 = FINDIDX (table, indirect, extra,
> + idx2 = FINDIDX (encoding, table, indirect, extra,
> &np, string_end - n);
> if (idx2 != 0
> && (idx >> 24) == (idx2 >> 24)
OK.
> diff --git a/posix/regcomp.c b/posix/regcomp.c
> index b6126b7..011ef92 100644
> --- a/posix/regcomp.c
> +++ b/posix/regcomp.c
> @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
> uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
> if (nrules != 0)
> {
> + uint_fast32_t encoding;
> const int32_t *table, *indirect;
> const unsigned char *weights, *extra, *cp;
> unsigned char char_buf[2];
> @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
> size_t len;
> /* Calculate the index for equivalence class. */
> cp = name;
> + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
> table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
> weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
> _NL_COLLATE_WEIGHTMB);
> @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
> _NL_COLLATE_EXTRAMB);
> indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
> _NL_COLLATE_INDIRECTMB);
> - idx1 = findidx (table, indirect, extra, &cp, -1);
> + idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
> if (BE (idx1 == 0 || *cp != '\0', 0))
> /* This isn't a valid character. */
> return REG_ECOLLATE;
> @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
> {
> char_buf[0] = ch;
> cp = char_buf;
> - idx2 = findidx (table, indirect, extra, &cp, 1);
> + idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
> /*
> idx2 = table[ch];
> */
OK
> diff --git a/posix/regex_internal.h b/posix/regex_internal.h
> index 02e040b..993c7c3 100644
> --- a/posix/regex_internal.h
> +++ b/posix/regex_internal.h
> @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx)
> # ifdef _LIBC
> const unsigned char *p, *extra;
> const int32_t *table, *indirect;
> + uint_fast32_t encoding;
> uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
>
> if (nrules != 0)
> {
> + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
> table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
> extra = (const unsigned char *)
> _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
> indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
> _NL_COLLATE_INDIRECTMB);
> p = pstr->mbs + idx;
> - findidx (table, indirect, extra, &p, pstr->len - idx);
> + findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
> return p - pstr->mbs - idx;
> }
> else
OK.
> diff --git a/posix/regexec.c b/posix/regexec.c
> index ec46c3a..3d3ad9a 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
> if (nrules != 0)
> {
> unsigned int in_collseq = 0;
> + uint_fast32_t encoding;
> const int32_t *table, *indirect;
> const unsigned char *weights, *extra;
> const char *collseqwc;
> @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
> if (cset->nequiv_classes)
> {
> const unsigned char *cp = pin;
> + encoding =
> + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
> table = (const int32_t *)
> _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
> weights = (const unsigned char *)
> @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
> _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
> indirect = (const int32_t *)
> _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
> - int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
> + int32_t idx = findidx (encoding, table, indirect, extra, &cp,
> + elem_len);
> if (idx > 0)
> for (i = 0; i < cset->nequiv_classes; ++i)
> {
OK.
> diff --git a/string/strcoll_l.c b/string/strcoll_l.c
> index 4d1e3ab..2c2cab0 100644
> --- a/string/strcoll_l.c
> +++ b/string/strcoll_l.c
> @@ -63,9 +63,9 @@ typedef struct
> /* Get next sequence. Traverse the string as required. */
> static __always_inline void
> get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> - const USTRING_TYPE *weights, const int32_t *table,
> - const USTRING_TYPE *extra, const int32_t *indirect,
> - int pass)
> + const USTRING_TYPE *weights, uint_fast32_t encoding,
> + const int32_t *table, const USTRING_TYPE *extra,
> + const int32_t *indirect, int pass)
> {
> size_t val = seq->val = 0;
> int len = seq->len;
> @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> us = seq->back_us;
> while (i < backw)
> {
> - int32_t tmp = findidx (table, indirect, extra, &us, -1);
> + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
> idx = tmp & 0xffffff;
> i++;
> }
> @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
>
> while (*us != L('\0'))
> {
> - int32_t tmp = findidx (table, indirect, extra, &us, -1);
> + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
> unsigned char rule = tmp >> 24;
> prev_idx = idx;
> idx = tmp & 0xffffff;
> @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
> const USTRING_TYPE *weights;
> const USTRING_TYPE *extra;
> const int32_t *indirect;
> + uint_fast32_t encoding;
>
> if (nrules == 0)
> return STRCMP (s1, s2);
> @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
> current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string;
> indirect = (const int32_t *)
> current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string;
> + encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> +
>
> assert (((uintptr_t) table) % __alignof__ (table[0]) == 0);
> assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0);
> @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
>
> while (1)
> {
> - get_next_seq (&seq1, nrules, rulesets, weights, table,
> + get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
> extra, indirect, pass);
> - get_next_seq (&seq2, nrules, rulesets, weights, table,
> + get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
> extra, indirect, pass);
> /* See whether any or both strings are empty. */
> if (seq1.len == 0 || seq2.len == 0)
OK.
> diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
> index 22e24d3..5c89b15 100644
> --- a/string/strxfrm_l.c
> +++ b/string/strxfrm_l.c
> @@ -53,6 +53,7 @@ typedef struct
> uint_fast32_t nrules;
> unsigned char *rulesets;
> USTRING_TYPE *weights;
> + uint_fast32_t encoding;
> int32_t *table;
> USTRING_TYPE *extra;
> int32_t *indirect;
> @@ -100,8 +101,8 @@ static __always_inline size_t
> find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
> unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
> {
> - int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
> - -1);
> + int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
> + l_data->extra, us, -1);
> *rule_idx = tmp >> 24;
> int32_t idx = tmp & 0xffffff;
> size_t len = l_data->weights[idx++];
> @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
> /* Get the locale data. */
> l_data.rulesets = (unsigned char *)
> current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
> + l_data.encoding =
> + current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> l_data.table = (int32_t *)
> current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
> l_data.weights = (USTRING_TYPE *)
> @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
>
> do
> {
> - int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
> - -1);
> + int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
> + l_data.extra, &cur, -1);
> rulearr[idxmax] = tmp >> 24;
> idxarr[idxmax] = tmp & 0xffffff;
>
OK.
--
Cheers,
Carlos.