This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale

Hi Carlos,

>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
> How much memory on average?
> Worst case is all possible 2-byte hash values?
> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?

TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.

>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
> Can we use UTF-8-specific knowledge to accelerate the lookup?
> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
> Could you do two checks?
> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
> (b) If in a 1 byte sequence use a one-byte table.
> (c) If in a 2-6 byte sequence use the hash-table?

I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
take that path.

>>> Will this always work? I'm just wondering about a user generated charmap that they
>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
> Could you check to see what value 'code_set_name' uses internally?
> Is it always 'UTF-8'? If it is, then the check you have is just fine.
> Otherwise we should check for utf8 and UTF-8.
> We don't know until you verify the values code_set_name can have.

The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.

>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>> index 8214ff5..5a9ed6a 100644
>> --- a/locale/C-collate.c
>> +++ b/locale/C-collate.c
>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>      { .string = (const char *) collseqwc },
>>      /* _NL_COLLATE_CODESET */
>> -    { .string = _nl_C_codeset }
>> +    { .string = _nl_C_codeset },
>> +    { .word = __cet_8bit }
> This makes locale-archive incompatible again right?
> Users have to regenerate the locale-archive after the upgrade?
> We need a release note for that under "Packaging Changes"
> The release note should mention the binary locale-archive format
> has changed and that locale-archive must be removed before upgrading
> and then recompiled after upgrade.

Yes. Who should do it?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]