This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug localedata/14094] Update locale data to Unicode 7.0.0

From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Thu, 06 Nov 2014 11:45:32 +0000
Subject: [Bug localedata/14094] Update locale data to Unicode 7.0.0
Auto-submitted: auto-generated
References: <bug-14094-131 at http dot sourceware dot org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=14094

--- Comment #23 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Now Pravinâs approach in the patch attached to comment#15
is to comment out the generation  of âupperâ, âlowerâ
and âalphaâ from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.

But this is a bit problematic.

1) it does not put digits like

   alpha: Missing: Ù 0x660 ARABIC-INDIC DIGIT ZERO

into âalphaâ, which  gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment

          /* Consider all the non-ASCII digits as alphabetic.
         ISO C 99 forbids us to have them in category "digit",
         but we want iswalnum to return true on them.  */

which sounds reasonable.

2) it does not put characters like

    lower: Missing: Ç 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON

into lower. This is actually title case, not lower case,
but glibc does have only âlowerâ and âupperâ, not âtitleâ.
Although it has âtoupperâ, âtolowerâ, and âtotitleâ.

gen-unicode-ctype.c puts characters which change when âtoupperâ
is applied into âlowerâ and characters which change when âtolowerâ
is applied into âupperâ. Therefore, gen-unicode-ctype.c
puts title case characters like Ç 0x1c5 into *both*, âupperâ *and*
âlowerâ. Which seems reasonable if glibc has no âtitleâ.

3) it does not put some characters like:

    upper: Missing: á 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
PROSGEGRAMMENI

into âupperâ. Surprisingly,

âU+1F88 á GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENIâ
is *not* listed as âUppercaseâ in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

    1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
0345;;;;N;;;1F88;;1F88
    1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08
0345;;;;N;;;;1F80;

So this might be a bug in DerivedCoreProperties.txt.

Generating âupperâ and âlowerâ the way gen-unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for âupperâ
and âlowerâ is fine, it is not necessary to use DerivedCoreProperties.txt
for this.

4) *many* characters end up being in âalphaâ *and* âpunctâ

For example:

    error: â 0x2df6 is alpha and punct

gen-unicode-ctype.c has the comment:

      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */

This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravinâs 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
âpunctâ before putting it into âalphaâ.

The character  ââ U+2df6 COMBINING CYRILLIC LETTER Aâ is âAlphabeticâ
according to DerivedCoreProperties.txt:

    2DE0..2DFF    ; Alphabetic # Mn  [32] COMBINING CYRILLIC LETTER
BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS

So Pravinâs script does rightly put it in to âalphaâ.

But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.

Verifications like gen-unicode-ctype.c does:

      /* toupper restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...  
      /* tolower restriction: "Only characters specified for the keywords
     lower and upper shall be specified.  */
      ...
      /* alpha restriction: "Characters classified as either upper or lower
     shall automatically belong to this class.  */
      ...
      /* alpha restriction: "No character specified for the keywords cntrl,
     digit, punct or space shall be specified."  */
      ...
      /* space restriction: "No character specified for the keywords upper,
     lower, alpha, digit, graph or xdigit shall be specified."
     upper, lower, alpha already checked above.  */
      ...
      /* cntrl restriction: "No character specified for the keywords upper,
     lower, alpha, digit, punct, graph, print or xdigit shall be
     specified."  upper, lower, alpha already checked above.  */
      ...

can be done much easier when using a single program.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]