This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/14094] Update locale data to Unicode 7.0.0
- From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Thu, 06 Nov 2014 11:45:32 +0000
- Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
- Auto-submitted: auto-generated
- References: <bug-14094-131 at http dot sourceware dot org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=14094
--- Comment #23 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now Pravinâs approach in the patch attached to comment#15
is to comment out the generation of âupperâ, âlowerâ
and âalphaâ from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.
But this is a bit problematic.
1) it does not put digits like
alpha: Missing: Ù 0x660 ARABIC-INDIC DIGIT ZERO
into âalphaâ, which gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment
/* Consider all the non-ASCII digits as alphabetic.
ISO C 99 forbids us to have them in category "digit",
but we want iswalnum to return true on them. */
which sounds reasonable.
2) it does not put characters like
lower: Missing: Ç 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON
into lower. This is actually title case, not lower case,
but glibc does have only âlowerâ and âupperâ, not âtitleâ.
Although it has âtoupperâ, âtolowerâ, and âtotitleâ.
gen-unicode-ctype.c puts characters which change when âtoupperâ
is applied into âlowerâ and characters which change when âtolowerâ
is applied into âupperâ. Therefore, gen-unicode-ctype.c
puts title case characters like Ç 0x1c5 into *both*, âupperâ *and*
âlowerâ. Which seems reasonable if glibc has no âtitleâ.
3) it does not put some characters like:
upper: Missing: á 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
PROSGEGRAMMENI
into âupperâ. Surprisingly,
âU+1F88 á GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENIâ
is *not* listed as âUppercaseâ in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .
Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:
1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
0345;;;;N;;;1F88;;1F88
1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08
0345;;;;N;;;;1F80;
So this might be a bug in DerivedCoreProperties.txt.
Generating âupperâ and âlowerâ the way gen-unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for âupperâ
and âlowerâ is fine, it is not necessary to use DerivedCoreProperties.txt
for this.
4) *many* characters end up being in âalphaâ *and* âpunctâ
For example:
error: â 0x2df6 is alpha and punct
gen-unicode-ctype.c has the comment:
/* alpha restriction: "No character specified for the keywords cntrl,
digit, punct or space shall be specified." */
This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravinâs 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
âpunctâ before putting it into âalphaâ.
The character ââ U+2df6 COMBINING CYRILLIC LETTER Aâ is âAlphabeticâ
according to DerivedCoreProperties.txt:
2DE0..2DFF ; Alphabetic # Mn [32] COMBINING CYRILLIC LETTER
BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS
So Pravinâs script does rightly put it in to âalphaâ.
But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.
Verifications like gen-unicode-ctype.c does:
/* toupper restriction: "Only characters specified for the keywords
lower and upper shall be specified. */
...
/* tolower restriction: "Only characters specified for the keywords
lower and upper shall be specified. */
...
/* alpha restriction: "Characters classified as either upper or lower
shall automatically belong to this class. */
...
/* alpha restriction: "No character specified for the keywords cntrl,
digit, punct or space shall be specified." */
...
/* space restriction: "No character specified for the keywords upper,
lower, alpha, digit, graph or xdigit shall be specified."
upper, lower, alpha already checked above. */
...
/* cntrl restriction: "No character specified for the keywords upper,
lower, alpha, digit, punct, graph, print or xdigit shall be
specified." upper, lower, alpha already checked above. */
...
can be done much easier when using a single program.
--
You are receiving this mail because:
You are on the CC list for the bug.