This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale


On Tue, Jun 30, 2015 at 07:52:21AM +0200, Leonhard Holz wrote:
> > Just use a trie. You would avoid slow lists by precomputing it so you
> > could do following:
> > 
> > struct trie
> > {
> >   int32_t index[256];
> > }
> > 
> > int32_t
> > findidx (unsigned char *p, struct trie *start)
> > {
> >   struct trie *t = start;
> >   while (1)
> >     {
> >       unsigned char c = *p++;
> >       if (t->index[c] >= 0)
> >         return t->index[c];
> >       else
> >         t = start - t->index[c];
> >     }
> > }
> > 
> > 
> 
> I agree, but that goes way beyond the scope of the patch. I suggest to use the
> patch given to solve the regression problem and refactor the datastructure as a
> follow-up.
>
Thats certainly possible but you have problem that with new
datastructure this patch becomes reduntant as you delete all relevant
code and replace it with better. So it would be better to just apply
next step and skip this intermediate one.
 
> > Also you don't need table access at all for utf8 unless somebody adds
> > locale with exotic characters. Just use utf8 codepoints. Here is how
> > calculate them for 1-3 byte sequences for first 65536 indices and for
> > 4-byte use trie.
> > 
> > 
> >   if (p[0] < 0x80) {
> >     return p[0];
> >   } else if (p[0] < 0xE0) {
> >     /* 2-byte sequence */
> >     /* No need to check size due trailing 0.  */
> >     return (p[0] << 6) + p[1] - 0x3080;
> >   } else if (code_unit1 < 0xF0) {
> >     /* 3-byte sequence */
> >     if (size < 3) goto error;
> >     return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
> >   } 
> > 
> > 
> 
> Actually the code does convert the UTF-8 sequences to codepoints and uses them as
> hashkey while ignoring the upper bytes in the four-bytes-UTF-8 case:
> 
> +      if ((index & 0xE0) == 0xC0)
> +	{
> +	  if (len < 2)
> +	    goto utf8_error;
> +	  uint16_t byte2 = (*cpp)[1];
> +	  index = ((index & 0x1F) << 6) | (byte2 & 0x3F);
> +	}
> +      else if ((index & 0xF0) == 0xE0)
> +	{
> +	  if (len < 3)
> +	    goto utf8_error;
> +	  uint16_t byte2 = (*cpp)[1];
> +	  uint16_t byte3 = (*cpp)[2];
> +	  index = ((index & 0xF) << 12) | ((byte2 & 0x3F) << 6) |
> +		  (byte3 & 0x3F);
> +	}
> +      else if ((index & 0xF8) == 0xF0)
> +	{
> +	  if (len < 4)
> +	    goto utf8_error;
> +	  uint16_t byte3 = (*cpp)[2];
> +	  uint16_t byte4 = (*cpp)[3];
> +	  index = ((index & 0xF) << 12) | ((byte3 & 0x3F) << 6) |
> +		  (byte4 & 0x3F);
> +	}
> +      else
> +	{
> +	utf8_error:
> +	  *cpp += 1;
> +	  return 0;
> +	}
> +    }
> 
> Maybe my conversion method is a bit "academic" but I think it is a reliable one.

I wrote about that in sibling thread that this could be botteneck so you
should optimize that, there are lot of completely unnecessary ands. 

Mine should be faster, its straigth from wikipedia utf8 page with
removed checks and replaced getchar() (There is still typo code_unit1
that I didn't replace with p[0])

Both will produce codepoints for valid inputs and random number from
0-65536 for invalid when you store that in uint16_t variable. One could
for performance use int to save truncation instruction in 1-2 byte case.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]