This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: locale differences to Li18nux.org locales
Martin Strassburger wrote:
> Last week I downloaded the package universal locales that was offered at
> www.li18nux.org.
The download location at li18nux doesn't work for me. I downloaded it
from IBM
http://oss.software.ibm.com/developerworks/opensource/locale?open&l=linuxlst04,t=gr,p=Unicode
> <U00A0> NO-BREAK SPACE as space,
This is wrong. The isspace/iswspace function is often used for
line-breaking purposes, and the Unicode 3.0 book says on p. 149
"U+0020 and U+00A0 behave differently for line breaking."
> <U064B> ARABIC FATHATAN,
> <U064C> ARABIC DAMMATAN,
> <U064D> ARABIC KASRATAN,
> <U064E> ARABIC FATHA,
> <U064F> ARABIC DAMMA,
> <U0650> ARABIC KASRA,
> <U0651> ARABIC SHADDA,
> <U0652> ARABIC SUKUN as alpha
These are combining characters. What's the purpose of putting them in
category "alpha"? Reasonable programs would apply isalpha() only to
non-combining characters. Otherwise you would need to make all of
U+0300..U+030C alpha as well.
> <U0390> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS,
> <U03B0> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS as lowercase
> alpha
They have no uppercase equivalent. But it might make sense to make
them lowercase nevertheless, like what is done with U+00DF. I'll
consider a patch for this.
> <U200E> LEFT-TO-RIGHT MARK,
> <U200F> RIGHT-TO-LEFT MARK as control
Control characters are automatically non-printing according to
POSIX. This means that wcswidth() of any string containing these two
characters would return -1, causing lots of problems. Also, I don't
understand why then U+200C and U+200D shouldn't be considered
"control" as well - they are in category "Cf" as well.
Bruno