[Bug localedata/14094] Update locale data to Unicode 7.0.0
maiku.fabian at gmail dot com
sourceware-bugzilla@sourceware.org
Thu Nov 6 11:05:00 GMT 2014
https://sourceware.org/bugzilla/show_bug.cgi?id=14094
--- Comment #21 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now when using gen-unicode-ctype.c with UnicodeData.txt-7.0.0
to generate LC_CTYPE, the generated file lacks far fewer
characters compared to the old i18n file in glibc:
alpha: Missing 246 characters of old ctype in new ctype
blank: Missing 1 characters of old ctype in new ctype
cntrl: Missing 0 characters of old ctype in new ctype
combining: Missing 3 characters of old ctype in new ctype
combining_level3: Missing 5 characters of old ctype in new ctype
digit: Missing 0 characters of old ctype in new ctype
graph: Missing 0 characters of old ctype in new ctype
lower: Missing 20 characters of old ctype in new ctype
print: Missing 0 characters of old ctype in new ctype
punct: Missing 16 characters of old ctype in new ctype
space: Missing 1 characters of old ctype in new ctype
tolower: Missing 0 characters of old ctype in new ctype
totitle: Missing 0 characters of old ctype in new ctype
toupper: Missing 0 characters of old ctype in new ctype
upper: Missing 0 characters of old ctype in new ctype
xdigit: Missing 0 characters of old ctype in new ctype
For example, gen-unicode-ctype.c does not put U+0901 into
the “alpha” class although it should be there
according to DerivedCoreProperties.txt:
error: 0x901 ँ alpha False: These have general category “Mn” i.e. these are
combining
characters (both in UnicodeData.txt 5.0.0 and 7.0.0):
“0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;”,
”0902;DEVANAGARI SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;”,
“0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;”.
According to DerivedCoreProperties.txt (7.0.0) these are
“Alphabetic”.
Apparently this has been edited manually (correctly) in the old i18n file
of glibc.
So this would be fixed in the automatic generation
when using DerivedCoreProperties.txt for “alpha”.
But some of the above seem to be errors in the old i18n file
of glib, for example:
error: 0x1090 ႐ punct True: MYANMAR SHAN DIGIT ZERO - MYANMAR SHAN DIGIT NINE.
These are digits, but because ISO C 99 forbids to
put them into digit they should go into alpha.
This is in “punct” in the old i18n file but gen-unicode-ctype.c
would put it into “alpha” which seems better for such digits
according to the comments in gen-unicode-ctype.c.
I went through all these “Missing” characters individually
and looked them up in UnicodeData.txt and DerivedCoreProperties.txt,
checked what how should be classified and added test cases
for them to the ctype-compatibility.py script.
I’ll attach the full report after using gen-unicode-ctype.c with
UnicodeData.txt-7.0.0 to generate LC_CTYPE.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libc-locales
mailing list