iswxxxxx/towxxxer and Unicode
Bruno Haible
haible@ilog.fr
Mon Jul 17 10:46:00 GMT 2000
Hi,
For fixing bug report libc/1251, I've prepared a patch which changes
the behaviour of the iswalpha etc. and towlower etc. functions. I
created an FDCC-set called "unicode" (automatically generated from
UnicodeData.txt) containing only an LC_CTYPE and LC_IDENTIFICATION
category. In all locales I changed
LC_CTYPE
copy "i18n"
END LC_CTYPE
to
LC_CTYPE
copy "unicode"
END LC_CTYPE
It works well, but there are some issues:
1) iswcntrl(0x0000) now returns 1. Why did you change iswcntrl(0x0000)
to return 0 a few weeks ago? The UnicodeData.txt file classifies
0x0000 as a "<control>" character. The only characters which are
neither control nor printable (i.e. no attributes at all) are those
which have not been assigned by Unicode.
2) iswspace(0x00A0) now returns 0. This is mandated by SUSV2
( http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html ) which
says:
- print: Define characters to be classified as printable
characters, including the space character.
- graph: Define characters to be classified as printable
characters, not including the space character.
>From this I infer that the difference between print and graph is only
the space character (0x0020). Thus iswgraph(0x00A0) = 1. Furthermore
it says:
- space: no character specified for the keywords upper, lower,
alpha, digit, graph or xdigit can be specified.
Which forces iswspace(0x00A0) to be 0. Which is not a bad thing,
because for line-breaking and parsing purposes, U+00A0 must be treated
differently from U+0020.
3) The compiled LC_CTYPE locale is now 1.2 MB large; before it was
around 130 KB. With more than 60 supported locales, the
/usr/lib/locales/ directory will grow to 70 MB. (And I don't have
added the wcwidth information yet!)
4) localedef takes 4 minutes to create such a large LC_CTYPE
file, on a fast machine. Thus "make check" takes 20 minutes, and
"make localedata/install-locales" takes several hours.
I think 3) and 4) is unacceptable. I propose to change the format of
tables used for these properties to 2-stage tables. This way, you can
get away with 11 KB for each of the tolower/toupper tables and
probably around 2 KB on average for each of the attribute tables. Do
you want me to work on this?
Bruno
More information about the Libc-alpha
mailing list