This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: iswxxxxx/towxxxer and Unicode
Ulrich Drepper writes:
> Bruno Haible <haible@ilog.fr> writes:
> - there is still the problem to be resolved whether the wide character
> data in LC_CTYPE and LC_COLLATE should contains information about all
> wide characters or whether the information for all of them should be
> included. Currently the later is done and this unnecessarily blows up
> the data for all charsets != UTF-8
I think the decision to make wchar_t locale independent in GNU was a
good one, and I don't think it is unnecessary.
If, say, in an ISO-8859-2 locale, you consider <U016F> (LATIN SMALL
LETTER U WITH RING ABOVE) a valid character but its constituent,
<U030A> (COMBINING RING ABOVE) or <U02DA> (RING ABOVE), an invalid
character, then there is no point in having wchar_t == Unicode at
all. You could just as well say that a 'wchar_t' code is always the
same as the 'char' code for 8-bit charsets (which is what FreeBSD
does) - it would faster and just as problematic for Unicode aware
applications.
So please don't do half-baken Unicode support. Do it fully.
> There are several things involved:
>
> - the character names (in collate) must be cleaned up. If a name is
> in the Uxxxx form there is no need to store the name in the file. The
> value can be determined at runtime.
>
> - for UTF-8 the tables should be almost densly packed. I.e., no size
> improvements are possible unless you compress the table data as well.
> E.g., by collapsing ranges of characters with the same properties.
Thanks for the explanations. Here are the sizes of the various parts
of the en_US/LC_CTYPE locale (in hex).
offset size value
_NL_CTYPE_CLASS = _NL_ITEM (LC_CTYPE, 0), 00114 00300
_NL_CTYPE_TOUPPER, 00414 00600
_NL_CTYPE_TOLOWER, 00A14 00600
_NL_CTYPE_CLASS32, 01014 3B7E0
_NL_CTYPE_NAMES, 3C7F4 3B7E0
_NL_CTYPE_HASH_SIZE, 77FD4 4 0BE6
_NL_CTYPE_HASH_LAYERS, 77FD8 4 0014
_NL_CTYPE_CLASS_NAMES, 77FDC 0006C
_NL_CTYPE_MAP_NAMES, 78044 0001C
_NL_CTYPE_WIDTH, 78060 0EDF8
_NL_CTYPE_MB_CUR_MAX, 86E58 4
_NL_CTYPE_CODESET_NAME, 86E5C C
_NL_CTYPE_TOUPPER32, 86E68 3B7E0
_NL_CTYPE_TOLOWER32, C2648 3B7E0
_NL_CTYPE_INDIGITS_MB_LEN, FDE28
_NL_CTYPE_INDIGITS0_MB,
...
20 hash layers is quite a lot. This means that a hash table access of
a chinese character (see cname-lookup.h) would make 10 far distant
memory accesses, each likely to be a cache miss. I much prefer a
3-stage table lookup with 3 memory acesses.
I intend to compress the CLASS32, TOUPPER32, TOLOWER32, WIDTH parts
and to get rid of the NAMES part altogether.
> But keep the old implementation around
OK.
Bruno