This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: iswxxxxx/towxxxer and Unicode


Ulrich Drepper writes:
> Bruno Haible <haible@ilog.fr> writes:

> - there is still the problem to be resolved whether the wide character
>   data in LC_CTYPE and LC_COLLATE should contains information about all
>   wide characters or whether the information for all of them should be
>   included.  Currently the later is done and this unnecessarily blows up
>   the data for all charsets != UTF-8

I think the decision to make wchar_t locale independent in GNU was a
good one, and I don't think it is unnecessary.

If, say, in an ISO-8859-2 locale, you consider <U016F> (LATIN SMALL
LETTER U WITH RING ABOVE) a valid character but its constituent,
<U030A> (COMBINING RING ABOVE) or <U02DA> (RING ABOVE), an invalid
character, then there is no point in having wchar_t == Unicode at
all. You could just as well say that a 'wchar_t' code is always the
same as the 'char' code for 8-bit charsets (which is what FreeBSD
does) - it would faster and just as problematic for Unicode aware
applications.

So please don't do half-baken Unicode support. Do it fully.

> There are several things involved:
> 
> - the character names (in collate) must be cleaned up.  If a name is
>   in the Uxxxx form there is no need to store the name in the file.  The
>   value can be determined at runtime.
> 
> - for UTF-8 the tables should be almost densly packed.  I.e., no size
>   improvements are possible unless you compress the table data as well.
>   E.g., by collapsing ranges of characters with the same properties.

Thanks for the explanations. Here are the sizes of the various parts
of the en_US/LC_CTYPE locale (in hex).

                                             offset   size value
  _NL_CTYPE_CLASS = _NL_ITEM (LC_CTYPE, 0),   00114  00300
  _NL_CTYPE_TOUPPER,                          00414  00600
  _NL_CTYPE_TOLOWER,                          00A14  00600
  _NL_CTYPE_CLASS32,                          01014  3B7E0
  _NL_CTYPE_NAMES,                            3C7F4  3B7E0
  _NL_CTYPE_HASH_SIZE,                        77FD4      4  0BE6
  _NL_CTYPE_HASH_LAYERS,                      77FD8      4  0014
  _NL_CTYPE_CLASS_NAMES,                      77FDC  0006C
  _NL_CTYPE_MAP_NAMES,                        78044  0001C
  _NL_CTYPE_WIDTH,                            78060  0EDF8
  _NL_CTYPE_MB_CUR_MAX,                       86E58      4
  _NL_CTYPE_CODESET_NAME,                     86E5C      C
  _NL_CTYPE_TOUPPER32,                        86E68  3B7E0
  _NL_CTYPE_TOLOWER32,                        C2648  3B7E0
  _NL_CTYPE_INDIGITS_MB_LEN,                  FDE28
  _NL_CTYPE_INDIGITS0_MB,
  ...

20 hash layers is quite a lot. This means that a hash table access of
a chinese character (see cname-lookup.h) would make 10 far distant
memory accesses, each likely to be a cache miss. I much prefer a
3-stage table lookup with 3 memory acesses.

I intend to compress the CLASS32, TOUPPER32, TOLOWER32, WIDTH parts
and to get rid of the NAMES part altogether.

> But keep the old implementation around

OK.

Bruno

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]