This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Leonhard Holz <leonhard dot holz at web dot de>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 30 Jun 2015 11:31:42 +0200
- Subject: Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale
- Authentication-results: sourceware.org; auth=none
- References: <558EA828 dot 3080106 at web dot de> <20150628085753 dot GA4254 at domone> <55922E95 dot 8050801 at web dot de>
On Tue, Jun 30, 2015 at 07:52:21AM +0200, Leonhard Holz wrote:
> > Just use a trie. You would avoid slow lists by precomputing it so you
> > could do following:
> >
> > struct trie
> > {
> > int32_t index[256];
> > }
> >
> > int32_t
> > findidx (unsigned char *p, struct trie *start)
> > {
> > struct trie *t = start;
> > while (1)
> > {
> > unsigned char c = *p++;
> > if (t->index[c] >= 0)
> > return t->index[c];
> > else
> > t = start - t->index[c];
> > }
> > }
> >
> >
>
> I agree, but that goes way beyond the scope of the patch. I suggest to use the
> patch given to solve the regression problem and refactor the datastructure as a
> follow-up.
>
Thats certainly possible but you have problem that with new
datastructure this patch becomes reduntant as you delete all relevant
code and replace it with better. So it would be better to just apply
next step and skip this intermediate one.
> > Also you don't need table access at all for utf8 unless somebody adds
> > locale with exotic characters. Just use utf8 codepoints. Here is how
> > calculate them for 1-3 byte sequences for first 65536 indices and for
> > 4-byte use trie.
> >
> >
> > if (p[0] < 0x80) {
> > return p[0];
> > } else if (p[0] < 0xE0) {
> > /* 2-byte sequence */
> > /* No need to check size due trailing 0. */
> > return (p[0] << 6) + p[1] - 0x3080;
> > } else if (code_unit1 < 0xF0) {
> > /* 3-byte sequence */
> > if (size < 3) goto error;
> > return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
> > }
> >
> >
>
> Actually the code does convert the UTF-8 sequences to codepoints and uses them as
> hashkey while ignoring the upper bytes in the four-bytes-UTF-8 case:
>
> + if ((index & 0xE0) == 0xC0)
> + {
> + if (len < 2)
> + goto utf8_error;
> + uint16_t byte2 = (*cpp)[1];
> + index = ((index & 0x1F) << 6) | (byte2 & 0x3F);
> + }
> + else if ((index & 0xF0) == 0xE0)
> + {
> + if (len < 3)
> + goto utf8_error;
> + uint16_t byte2 = (*cpp)[1];
> + uint16_t byte3 = (*cpp)[2];
> + index = ((index & 0xF) << 12) | ((byte2 & 0x3F) << 6) |
> + (byte3 & 0x3F);
> + }
> + else if ((index & 0xF8) == 0xF0)
> + {
> + if (len < 4)
> + goto utf8_error;
> + uint16_t byte3 = (*cpp)[2];
> + uint16_t byte4 = (*cpp)[3];
> + index = ((index & 0xF) << 12) | ((byte3 & 0x3F) << 6) |
> + (byte4 & 0x3F);
> + }
> + else
> + {
> + utf8_error:
> + *cpp += 1;
> + return 0;
> + }
> + }
>
> Maybe my conversion method is a bit "academic" but I think it is a reliable one.
I wrote about that in sibling thread that this could be botteneck so you
should optimize that, there are lot of completely unnecessary ands.
Mine should be faster, its straigth from wikipedia utf8 page with
removed checks and replaced getchar() (There is still typo code_unit1
that I didn't replace with p[0])
Both will produce codepoints for valid inputs and random number from
0-65536 for invalid when you store that in uint16_t variable. One could
for performance use int to save truncation instruction in 1-2 byte case.