This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PING][PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale


Am 22.05.2016 um 17:52 schrieb Leonhard Holz:
> Hi Carlos,
> 
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> How much memory on average?
>>
>> Worst case is all possible 2-byte hash values?
>>
>> e.g. 256*256*(sizeof(struct element_t)) == ~10MB?
>>
>> Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?
> 
> TABLEMB is as an int32_t array, so it's 256*256*4 = 256k.
> 
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> Can we use UTF-8-specific knowledge to accelerate the lookup?
>>
>> For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.
>>
>> Could you do two checks?
>>
>> (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
>> (b) If in a 1 byte sequence use a one-byte table.
>> (c) If in a 2-6 byte sequence use the hash-table?
> 
> I tried it, but the switches necessary to make that work in locale/weight.h findidx() slow down the lookup (e.g. 5% for
> en_US.UTF-8 file listing). Because of the paddings of the collation file it does not save space either. So I would not
> take that path.
> 
>>>> Will this always work? I'm just wondering about a user generated charmap that they
>>>> call 'utf8', which is the other common alias for instance where the dash is not valid
>>>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
>>>
>>> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".
>>
>> Could you check to see what value 'code_set_name' uses internally?
>>
>> Is it always 'UTF-8'? If it is, then the check you have is just fine.
>>
>> Otherwise we should check for utf8 and UTF-8.
>>
>> We don't know until you verify the values code_set_name can have.
> 
> The source of code_set_name is the charmap file (localedata/charmaps/UTF-8) and there it is defined as UTF-8.
> 
>>> diff --git a/locale/C-collate.c b/locale/C-collate.c
>>> index 8214ff5..5a9ed6a 100644
>>> --- a/locale/C-collate.c
>>> +++ b/locale/C-collate.c
>>> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>>>      /* _NL_COLLATE_COLLSEQWC */
>>>      { .string = (const char *) collseqwc },
>>>      /* _NL_COLLATE_CODESET */
>>> -    { .string = _nl_C_codeset }
>>> +    { .string = _nl_C_codeset },
>>> +    /* _NL_COLLATE_ENCODING_TYPE */
>>> +    { .word = __cet_8bit }
>>
>> This makes locale-archive incompatible again right?
>>
>> Users have to regenerate the locale-archive after the upgrade?
>>
>> We need a release note for that under "Packaging Changes"
>> https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes
>>
>> The release note should mention the binary locale-archive format
>> has changed and that locale-archive must be removed before upgrading
>> and then recompiled after upgrade.
> 
> Yes. Who should do it?
> 
> Best,
> Leonhard
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]