This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: Leonhard Holz <leonhard dot holz at web dot de>
- To: GNU C Library <libc-alpha at sourceware dot org>
- Cc: Carlos O'Donell <carlos at redhat dot com>
- Date: Tue, 12 Jul 2016 08:50:14 +0200
- Subject: [PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <5741D5A9.1060606@web.de>
Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
>
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
>
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
>
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
>
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
>
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>> /* _NL_COLLATE_COLLSEQWC */
>>> { .string = (const char *) collseqwc },
>>> /* _NL_COLLATE_CODESET */
>>> - { .string = _nl_C_codeset }
>>> + { .string = _nl_C_codeset },
>>> + /* _NL_COLLATE_ENCODING_TYPE */
>>> + { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
>
> Yes. Who should do it?
>
> Best,
> Leonhard
>