This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][BZ #18441] fix sorting multibyte charsets with an improper locale


On Sat, Jun 27, 2015 at 03:42:00PM +0200, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
> 
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
> 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
> 
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
> 
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN and hi_IN below). A side effect is that languages with one-byte chars get a
> bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
> 
No, you can. With this patch you partially fixed a problem that
datastructures used in findidx are terrible.

Just use a trie. You would avoid slow lists by precomputing it so you
could do following:

struct trie
{
  int32_t index[256];
}

int32_t
findidx (unsigned char *p, struct trie *start)
{
  struct trie *t = start;
  while (1)
    {
      unsigned char c = *p++;
      if (t->index[c] >= 0)
        return t->index[c];
      else
        t = start - t->index[c];
    }
}


Also you don't need table access at all for utf8 unless somebody adds
locale with exotic characters. Just use utf8 codepoints. Here is how
calculate them for 1-3 byte sequences for first 65536 indices and for
4-byte use trie.


  if (p[0] < 0x80) {
    return p[0];
  } else if (p[0] < 0xE0) {
    /* 2-byte sequence */
    /* No need to check size due trailing 0.  */
    return (p[0] << 6) + p[1] - 0x3080;
  } else if (code_unit1 < 0xF0) {
    /* 3-byte sequence */
    if (size < 3) goto error;
    return (p[0] << 12) + (p[1] << 6) + p[2] - 0xE2080;
  } 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]