14094 – Update locale data to Unicode 7.0.0

Bug 14094 - Update locale data to Unicode 7.0.0

Summary: Update locale data to Unicode 7.0.0

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.21

Importance:	P2 normal
Target Milestone:	---
Assignee:	Pravin S

URL:
Keywords:

Duplicates (2):	14010 16969 (view as bug list)
Depends on:	17588
Blocks:
	Show dependency tree / graph

Reported:	2012-05-10 20:27 UTC by Joseph Myers
Modified:	2016-03-22 17:30 UTC (History)
CC List:	10 users (show)

See Also:	19852
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Patch to update UTF-8 CHARMAP to unicode 7.0 (190.16 KB, patch) 2014-07-04 09:13 UTC, Pravin S	Details \| Diff
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 (252.66 KB, patch) 2014-07-17 10:41 UTC, Pravin S	Details \| Diff
Patch to update UTF-8 i18n file (CTYPE) to unicode 7.0 (44.54 KB, patch) 2014-07-22 12:18 UTC, Pravin S	Details \| Diff
unicode-5.0.0-report-full-output (60.24 KB, text/plain) 2014-11-06 11:02 UTC, Mike FABIAN	Details
unicode-7.0.0-report-full-output (11.49 KB, text/plain) 2014-11-06 11:06 UTC, Mike FABIAN	Details
gen-unicode-ctype.py (4.51 KB, text/plain) 2014-11-14 07:15 UTC, Mike FABIAN	Details
gen-unicode-ctype.py (4.68 KB, text/plain) 2014-11-14 07:21 UTC, Mike FABIAN	Details
report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0 (16.05 KB, text/plain) 2014-11-14 07:24 UTC, Mike FABIAN	Details
gen-unicode-ctype.py (5.56 KB, text/x-python) 2014-12-01 10:14 UTC, Mike FABIAN	Details
0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch (52.88 KB, patch) 2014-12-03 12:27 UTC, Mike FABIAN	Details \| Diff
0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch (680 bytes, patch) 2014-12-03 12:27 UTC, Mike FABIAN	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joseph Myers 2012-05-10 20:27:32 UTC

The Unicode locale data - character map and LC_CTYPE information - should be updated from Unicode 6.1 (the character map is currently based on 6.0, and LC_CTYPE is currently based on 5.0).  This should be done with proper automation and wiki documentation being added of how to do future updates.  I identified the following tasks at <http://sourceware.org/ml/libc-alpha/2012-05/msg00590.html>:

* Ensure the character type data in localedata/charmaps/i18n can be
  properly reproduced from Unicode 5.0 data using gen-unicode-ctype.c,
  adapting gen-unicode-ctype.c as needed to replicate any changes that
  may have been made not using that program.

* Update the character type data to Unicode 6.1, removing any local
  hacks from gen-unicode-ctype.c that are no longer needed.
  (10646:2012, corresponding to Unicode 6.1, appears to be in
  publication stage so should be out very soon.)

* Ensure the character data in localedata/charmaps/UTF-8 can be
  reproduced in some automated fashion from Unicode 6.0, locating any
  previously used automation for this or creating some new automation
  if any previous automation can't be found.

* Update the character data to Unicode 6.1, removing any local hacks
  in the automation from the previous step.

* Document thoroughly on the wiki how the automation works and how to
  do updates to new Unicode versions.

Comment 1 Rich Felker 2012-05-11 03:25:47 UTC

One of the major "local hacks" can be fixed, fixing many other problems at the same time, by switching to using the Unicode "Alphabetic" property (from DerivedCoreProperties.txt) instead of just categories L* for class alpha. Right now there are many languages whose letters are considered non-alphabetic by glibc because they're in category Mn or Mc or even Cf. There are "local hacks" to fix this for maybe one or two languages, but using the right Unicode property would fix it for all languages.

Comment 2 Johannes Löthberg 2014-05-20 22:13:48 UTC

*** Bug 16969 has been marked as a duplicate of this bug. ***

Comment 3 Pravin S 2014-05-23 07:53:54 UTC

Rather than Uniocode 6.1, it should be Unicode 6.3.

Two files as mentioned in bug are 
1. i18n (LC_CTYPE) (it used to be generated by gen-unicode-ctype.c, )
2. UTF-8 (it looks conversion from Unicode to UTF-8), i will find out 

Are there any other files also involved in upgrading glibc localedata to Unicode 6.1?

Comment 4 jsm-csl@polyomino.org.uk 2014-05-23 12:02:39 UTC

Once the data is updated (maybe once just the character map is updated), 
__STDC_ISO_10646__ should be updated in include/stdc-predef.h to reflect 
the publication date of the edition or amendment to ISO 10646 
corresponding to the version of Unicode in use.

I advise keeping each of the tasks I listed as a separate patch, as it's 
important to be confident we aren't losing desired local changes in the 
course of the update (which means the existing files need to be reproduced 
exactly by some automation before the update is done).

Bug 16061 relates to transliteration data, some of which came from 
Unicode, and bug 14095 to collation data.  The same principles apply to 
those - reproduce the existing files, understanding any local changes in 
the process, then update to a newer Unicode version - but they are likely 
to involve much more work in understanding the existing state then 
updating while preserving any desired local changes.

Comment 5 Pravin S 2014-05-23 13:20:38 UTC

Yeah, Backward compatibility is must. 
I will write small script to check we are not changing existing maps, so we can be confident before commiting.

Comment 6 Pravin S 2014-06-10 09:37:56 UTC

I have written script for checking backward compabitibility of new LC_CTYPE with old LC_CTYPE.

Script is available at https://github.com/pravins/glibc-i18n

Important thing for us presently is report generated by script. i.e. 

https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report

While doing this also found in existing i18n file <U0D70>..<U0D75>; included twice.

% MALAYALAM/
   <U0D66>..<U0D75>;<U0D70>..<U0D75>;/

Let me know if anything is missing.

In next step, i will check missing characters from LC_CTYPE 5.0.0 with LC_CTYPE 6.3.0 and confirm are these intentional changes at Unicode or something we are missing.

Will be ready with patch for updating LC_CTYPE next time.

Comment 7 Carlos O'Donell 2014-06-10 14:38:23 UTC

(In reply to Pravin S from comment #6)
> I have written script for checking backward compabitibility of new LC_CTYPE
> with old LC_CTYPE.
> 
> Script is available at https://github.com/pravins/glibc-i18n
> 
> Important thing for us presently is report generated by script. i.e. 
> 
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report
> 
> While doing this also found in existing i18n file <U0D70>..<U0D75>; included
> twice.
> 
> % MALAYALAM/
>    <U0D66>..<U0D75>;<U0D70>..<U0D75>;/
> 
> Let me know if anything is missing.
> 
> In next step, i will check missing characters from LC_CTYPE 5.0.0 with
> LC_CTYPE 6.3.0 and confirm are these intentional changes at Unicode or
> something we are missing.
> 
> Will be ready with patch for updating LC_CTYPE next time.

Thanks Pravin! I think the missing step is to get these scripts checked into glibc's script/ directory so that we have them in a central location with some internal comments showing how to run the script. This way we can re-run them at later stages to verify what's missing and stay in sync (say the release manager runs it before a release).

Eventually we want a documented process here:
https://sourceware.org/glibc/wiki/Regeneration

Even if it's just "Run this script. Fix all warnings by hand" it would be a good start.

Comment 8 Pravin S 2014-06-11 03:49:46 UTC

Agree with you, will do it.

Comment 9 Pravin S 2014-06-19 10:28:00 UTC

(In reply to Rich Felker from comment #1)
> One of the major "local hacks" can be fixed, fixing many other problems at
> the same time, by switching to using the Unicode "Alphabetic" property (from
> DerivedCoreProperties.txt) instead of just categories L* for class alpha.
> Right now there are many languages whose letters are considered
> non-alphabetic by glibc because they're in category Mn or Mc or even Cf.
> There are "local hacks" to fix this for maybe one or two languages, but
> using the right Unicode property would fix it for all languages.

I was almost done with things bug While updating this, i found around 248 characters were added after gen-unicode-ctype.c processing in ALPHA group in present i18n CTYPE (Unicode 5.1 https://github.com/pravins/glibc-i18n/blob/master/unicode5-1/Report ) and i am facing same issue while upgrading it to Unicode 6.3 (246 characters) (https://github.com/pravins/glibc-i18n/blob/master/Report)

During reading http://www.unicode.org/reports/tr44/#Property_List_Table It is mentioned 
 
"Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so."  

I agree with Rich, We should collect available things from DerivedCoreProperties.txt rather than processing raw UnicodeData.txt. I am writing script to process groups from DerivedCoreProperties.txt

Comment 10 Pravin S 2014-06-21 19:10:44 UTC

I am working with latest Unicode standard, so updated bug summary.

Comment 11 Pravin S 2014-06-25 12:24:39 UTC

(In reply to Joseph Myers from comment #0)
> 
> * Ensure the character data in localedata/charmaps/UTF-8 can be
>   reproduced in some automated fashion from Unicode 6.0, locating any
>   previously used automation for this or creating some new automation
>   if any previous automation can't be found.

  Me too not able to find previous automation for same. 

  I can simply pass all Unicode to python unicode-to-utf8 and format it as required by UTF-8 file.

  Any hint on how to do this?

Comment 12 Carlos O'Donell 2014-06-25 13:47:52 UTC

(In reply to Pravin S from comment #11)
> (In reply to Joseph Myers from comment #0)
> > 
> > * Ensure the character data in localedata/charmaps/UTF-8 can be
> >   reproduced in some automated fashion from Unicode 6.0, locating any
> >   previously used automation for this or creating some new automation
> >   if any previous automation can't be found.
> 
>   Me too not able to find previous automation for same. 
> 
>   I can simply pass all Unicode to python unicode-to-utf8 and format it as
> required by UTF-8 file.
> 
>   Any hint on how to do this?

Not really, this is why this problem requires "work" ;-)

Comment 13 Pravin S 2014-07-04 09:13:23 UTC

Created attachment 7679 [details]
Patch to update UTF-8 CHARMAP to unicode 7.0

 I have worked on updating UTF-8 file to Unicode 7.0. Following are the
important points before review this patch.

  1. Present patch is only for CHARMAP, patch for updating WIDTH will be
available soon.
  2. utf8-gen.py: New script to generate UTF-8 file.
  3. patch is created by ignoring space changes (-w)
  4.
   ''' Where UnicodeData.txt file has given characters in range
    Example:
    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

    UTF-8 file mention these range by adding 0x3F inbetween First and
Last Unicode character.
    Example:
    <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
    .
    .
    <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>

*    Note: No idea why Hangul syllable AC00; D7A3; were not expanded in
Unicode **
**    5.0 UTF-8. We are following consistency and expanding Hangul as
well.**
*    '''

    5. Name changes are in UnicodeData.txt in some cases.
    ''' Some characters have <control> as a name, so using "Unicode 1.0
Name" 
     Characters U+0080, U+0081, U+0084 and U+0099 has "<control>" as a
name and even no "Unicode 1.0 Name" (10th field) in UnicodeData.txt
     We can write code to take there alternate name from NameAliases.txt '''

Comment 14 Pravin S 2014-07-17 10:41:16 UTC

Created attachment 7715 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Done with all work with UTF-8 file. 
Added two script:
1. utf8-gen.py to generate UTF-8 file
2. utf8-compatibility.py : to check backward compatibility of newly generated UTF-8 file
3. Report of new UTF-8 file backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8

Submitting to glibc-alpha, please help to quick review and push to git.

Comment 15 Pravin S 2014-07-22 12:18:16 UTC

Created attachment 7720 [details]
Patch to update UTF-8 i18n file (CTYPE) to unicode 7.0

Patch does the following stuff:
* locales/i18n: Updated to Unicode 7.0.0

* scripts/gen-unicode-ctype.c: Disabled upper, lower, alpha and outdigit classes.

* scripts/ctype-gen.sh: Shell script to generate LC_CTYPE for new Unicode version.

* scripts/gen-unicode-ctype-dcp.py: New script for generating locales/i18n upper, lower and alpha ctype from DerivedCoreProperties.txt

* scripts/ctype-compatibility.py:  Script for testing testing backward compatibility of LC_CTYPE locales/i18n.

Report for backward compatibility is available at 
https://raw.githubusercontent.com/pravins/glibc-i18n/master/unicode7-0/ctype-compatibility5_1-to-7_0

Comment 16 Carlos O'Donell 2014-09-05 01:07:21 UTC

Pravin,

Is any part of your work ready for 2.21 when it opens?

Comment 17 Pravin S 2014-09-29 07:17:35 UTC

I am still waiting for someone to review these patches. 
Best way will be, 
1. Build glibc with patches.
2. Test WIDTH and CTYPE function (does it return proper value) may be one can do same with existing glibc and compare.

Comment 18 Mike FABIAN 2014-10-14 08:07:13 UTC

(In reply to Pravin S from comment #14)
> Created attachment 7715 [details]
> Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> 
> Done with all work with UTF-8 file. 
> Added two script:
> 1. utf8-gen.py to generate UTF-8 file
> 2. utf8-compatibility.py : to check backward compatibility of newly
> generated UTF-8 file
> 3. Report of new UTF-8 file backward compatibility is available AT
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
> 
> Submitting to glibc-alpha, please help to quick review and push to git.

I checked the scripts Pravin used and the resulting UTF-8 file.

I found only one minor problem:

In some cases, both UnicodeData.txt and EastAsianWidth.txt have information
about width. For example, EastAsianWidth.txt has:
    
    302A..302D;W     # Mn     [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC ENTERING TONE MARK
    
which gives us width 2 for these 4 characters (because of “W”) but
UnicodeData.txt has:
    
    302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;;
    302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;;
    302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;;
    302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;;
    
which would give width 0 (because of “NSM”).

I changed Pravin’s script a bit to prefer the information from
EastAsianWidth.txt in case of conflicts.

Pravin has already merged my change into his git repository.

Comment 19 Mike FABIAN 2014-11-06 11:00:02 UTC

I extended Pravin’s ctype-compatibility.py script to produce more
human readable output and added many extra tests.

Joseph Myers> * Ensure the character type data in
Joseph Myers>   localedata/charmaps/i18n can be properly reproduced from
Joseph Myers>   Unicode 5.0 data using gen-unicode-ctype.c, adapting
Joseph Myers>   gen-unicode-ctype.c as needed to replicate any changes
Joseph Myers>   that may have been made not using that program.

When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0
to generate LC_CTYPE, the generated file lacks
many characters which apparently have been manually added
to glibc’s i18n file:

alpha: Missing 1238 characters of old ctype in new ctype 
blank: Missing 0 characters of old ctype in new ctype 
cntrl: Missing 0 characters of old ctype in new ctype 
combining: Missing 124 characters of old ctype in new ctype 
combining_level3: Missing 49 characters of old ctype in new ctype 
digit: Missing 0 characters of old ctype in new ctype 
graph: Missing 1571 characters of old ctype in new ctype 
lower: Missing 115 characters of old ctype in new ctype 
print: Missing 1571 characters of old ctype in new ctype 
punct: Missing 335 characters of old ctype in new ctype 
space: Missing 0 characters of old ctype in new ctype 
tolower: Missing 19 characters of old ctype in new ctype 
totitle: Missing 8 characters of old ctype in new ctype 
toupper: Missing 18 characters of old ctype in new ctype 
upper: Missing 100 characters of old ctype in new ctype 
xdigit: Missing 0 characters of old ctype in new ctype 

I.e. reproducing the localedata/charmaps/i18n character type data
from Unicode 5.0 data using gen-unicode-ctype.c does not work
well because glibc’s i18n file apparently has been edited
manually a lot already to include newer Unicode data.

Apparently quite a few mistake have been made by manually editing
the i18n file. For example, the report from ctype-compatibility.py
also produces for the old i18n file:

error: 0xa67f ꙿ punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.
error: 0xa67f ꙿ alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode
            7.0.0. General category Lm (Letter
            modifier). DerivedCoreProperties.txt says it is
            “Alphabetic”. Apparently added manually to punct by mistake in
            glibc’s old LC_CTYPE.

Another example:

error: 0x9f4 ৴ alpha True: 
            “09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;;;;”
            “09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;;;”
            “09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N;;;;;”
            “09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;;;;”
            “09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR;No;0;L;;;;3/4;N;;;;;”
            “09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16;N;;;;;”
            “09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;”
            According to DerivedCoreProperties.txt (7.0.0) these are *not*
            “Alphabetic”.

So this has been mistakenly added to “alpha” in the old i18n file
of glibc (but gen-unicode-ctype.c correctly puts in into “punct”,
i.e. this seems to be another mistake by manual editing).

Some of the errors reported by ctype-compatibility.py

error: 0x250 ɐ lower False: Should be lower in Unicode 7.0.0 (was not lower in
            Unicode 5.0.0).
            
would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input.

There are many more problems like this in the old i18n file,
my tests found 133 errors total:

------------------------------------------------------------
Old file = /local/mfabian/src/glibc/localedata/locales/i18n
Number of errors in old file = 133
------------------------------------------------------------

I’ll attach the full report.

Comment 20 Mike FABIAN 2014-11-06 11:02:06 UTC

Created attachment 7907 [details]
unicode-5.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 5.0.0.

Comment 21 Mike FABIAN 2014-11-06 11:03:04 UTC

Now when using gen-unicode-ctype.c with UnicodeData.txt-7.0.0
to generate LC_CTYPE, the generated file lacks far fewer
characters compared to the old i18n file in glibc:

alpha: Missing 246 characters of old ctype in new ctype 
blank: Missing 1 characters of old ctype in new ctype 
cntrl: Missing 0 characters of old ctype in new ctype 
combining: Missing 3 characters of old ctype in new ctype 
combining_level3: Missing 5 characters of old ctype in new ctype 
digit: Missing 0 characters of old ctype in new ctype 
graph: Missing 0 characters of old ctype in new ctype 
lower: Missing 20 characters of old ctype in new ctype 
print: Missing 0 characters of old ctype in new ctype 
punct: Missing 16 characters of old ctype in new ctype 
space: Missing 1 characters of old ctype in new ctype 
tolower: Missing 0 characters of old ctype in new ctype 
totitle: Missing 0 characters of old ctype in new ctype 
toupper: Missing 0 characters of old ctype in new ctype 
upper: Missing 0 characters of old ctype in new ctype 
xdigit: Missing 0 characters of old ctype in new ctype

For example, gen-unicode-ctype.c does not put U+0901 into
the “alpha” class although it should be there
according to DerivedCoreProperties.txt:

error: 0x901 ँ alpha False: These have general category “Mn” i.e. these are combining
            characters (both in UnicodeData.txt 5.0.0 and 7.0.0):
            “0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;”,
            ”0902;DEVANAGARI SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;”,
            “0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;”.
            According to DerivedCoreProperties.txt (7.0.0) these are
            “Alphabetic”.  

Apparently this has been edited manually (correctly) in the old i18n file
of glibc.

So this would be fixed in the automatic generation
when using DerivedCoreProperties.txt for “alpha”.

But some of the above seem to be errors in the old i18n file
of glib, for example:

error: 0x1090 ႐ punct True: MYANMAR SHAN DIGIT ZERO - MYANMAR SHAN DIGIT NINE.
            These are digits, but because ISO C 99 forbids to
            put them into digit they should go into alpha.

This is in “punct” in the old i18n file but gen-unicode-ctype.c
would put it into “alpha” which seems better for such digits
according to the comments in gen-unicode-ctype.c.

I went through all these “Missing” characters individually
and looked them up in UnicodeData.txt and DerivedCoreProperties.txt,
checked what how should be classified and added test cases
for them to the ctype-compatibility.py script.

I’ll attach the full report after using gen-unicode-ctype.c with
UnicodeData.txt-7.0.0 to generate LC_CTYPE.

Comment 22 Mike FABIAN 2014-11-06 11:06:28 UTC

Created attachment 7908 [details]
unicode-7.0.0-report-full-output

Full report from ctype-compatibility.py when comparing the old i18n
file in glibc with the file generated by gen-unicode-ctype.c using
UnicodeData.txt from Unicode 7.0.0.

Comment 23 Mike FABIAN 2014-11-06 11:45:32 UTC

Now Pravin’s approach in the patch attached to comment#15
is to comment out the generation of “upper”, “lower”
and “alpha” from gen-unicode-ctype.c and add another
script gen-unicode-ctype-dcp.py which adds these.

But this is a bit problematic.

1) it does not put digits like

alpha: Missing: ٠ 0x660 ARABIC-INDIC DIGIT ZERO

into “alpha”, which gen-unicode-ctype.c would have done.
gen-unicode-ctype.c contains the comment

/* Consider all the non-ASCII digits as alphabetic.
ISO C 99 forbids us to have them in category "digit",
but we want iswalnum to return true on them. */

which sounds reasonable.

2) it does not put characters like

lower: Missing: ǅ 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON

into lower. This is actually title case, not lower case,
but glibc does have only “lower” and “upper”, not “title”.
Although it has “toupper”, “tolower”, and “totitle”.

gen-unicode-ctype.c puts characters which change when “toupper”
is applied into “lower” and characters which change when “tolower”
is applied into “upper”. Therefore, gen-unicode-ctype.c
puts title case characters like ǅ 0x1c5 into *both*, “upper” *and*
“lower”. Which seems reasonable if glibc has no “title”.

3) it does not put some characters like:

upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI

into “upper”. Surprisingly,

“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
is *not* listed as “Uppercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345;;;;N;;;1F88;;1F88
1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345;;;;N;;;;1F80;

So this might be a bug in DerivedCoreProperties.txt.

Generating “upper” and “lower” the way gen-unicode-ctype.c does,
i.e. just using UnicodeData.txt and check whether characters
change when mapping them to upper or to lower does not produce this
error. I think the approach gen-unicode-ctype.c uses for “upper”
and “lower” is fine, it is not necessary to use DerivedCoreProperties.txt
for this.

4) *many* characters end up being in “alpha” *and* “punct”

For example:

error: ⷶ 0x2df6 is alpha and punct

gen-unicode-ctype.c has the comment:

/* alpha restriction: "No character specified for the keywords cntrl,
digit, punct or space shall be specified." */

This restriction is violated because the the second script
gen-unicode-ctype-dcp.py used in Pravin’s 2-pass approach does not
check whether gen-unicode-ctype.c has already put a character into
“punct” before putting it into “alpha”.

The character “ⷶ U+2df6 COMBINING CYRILLIC LETTER A” is “Alphabetic”
according to DerivedCoreProperties.txt:

2DE0..2DFF ; Alphabetic # Mn [32] COMBINING CYRILLIC LETTER BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS

So Pravin’s script does rightly put it in to “alpha”.

But looking at this, it seems not a good idea to have two independent
programs generating the file in 2 independent passes.

Verifications like gen-unicode-ctype.c does:

/* toupper restriction: "Only characters specified for the keywords
lower and upper shall be specified. */
...
/* tolower restriction: "Only characters specified for the keywords
lower and upper shall be specified. */
...
/* alpha restriction: "Characters classified as either upper or lower
shall automatically belong to this class. */
...
/* alpha restriction: "No character specified for the keywords cntrl,
digit, punct or space shall be specified." */
...
/* space restriction: "No character specified for the keywords upper,
lower, alpha, digit, graph or xdigit shall be specified."
upper, lower, alpha already checked above. */
...
/* cntrl restriction: "No character specified for the keywords upper,
lower, alpha, digit, punct, graph, print or xdigit shall be
specified." upper, lower, alpha already checked above. */
...

can be done much easier when using a single program.

Comment 24 Mike FABIAN 2014-11-06 11:56:03 UTC

So I think we should do either:

1) improve gen-unicode-ctype.c and make it use
   DerivedCoreProperties.txt for “alpha”

or:

2) rewrite gen-unicode-ctype.c to Python
   First a rewrite which produces *exactly* the same
   output as gen-unicode-ctype.c, then add code
   to use DerivedCoreProperties.txt for “alpha”

No matter whether extending the C-Program or writing a Python program,
it should be a single program to be able to verify the restrictions
mentioned easily.

It would be nice of course to make the program read in the old i18n
file and replace the characters classes and write out a new file which
keeps the rest of the original file so that no manual copy&paste of
the generated character classes is necessary.

Comment 25 Mike FABIAN 2014-11-06 12:00:26 UTC

(In reply to Mike FABIAN from comment #24)

> No matter whether extending the C-Program or writing a Python program,
> it should be a single program to be able to verify the restrictions
> mentioned easily.

And as a 2nd pass, after the single program to generate the character
class data, use ctype-compatibility.py as a "test-suite".

Comment 26 Pravin S 2014-11-12 10:18:51 UTC

(In reply to Mike FABIAN from comment #18)
> (In reply to Pravin S from comment #14)
> > Created attachment 7715 [details]
> > Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> > 
> > Done with all work with UTF-8 file. 
> > Added two script:
> > 1. utf8-gen.py to generate UTF-8 file
> > 2. utf8-compatibility.py : to check backward compatibility of newly
> > generated UTF-8 file
> > 3. Report of new UTF-8 file backward compatibility is available AT
> > https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
> > 
> > Submitting to glibc-alpha, please help to quick review and push to git.
> 
> I checked the scripts Pravin used and the resulting UTF-8 file.
> 
> I found only one minor problem:
> 
> In some cases, both UnicodeData.txt and EastAsianWidth.txt have information
> about width. For example, EastAsianWidth.txt has:
>     
>     302A..302D;W     # Mn     [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
> ENTERING TONE MARK
>     
> which gives us width 2 for these 4 characters (because of “W”) but
> UnicodeData.txt has:
>     
>     302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;;
>     302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;;
>     302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;;
>     302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;;
>     
> which would give width 0 (because of “NSM”).
> 
> I changed Pravin’s script a bit to prefer the information from
> EastAsianWidth.txt in case of conflicts.
> 
> Pravin has already merged my change into his git repository.

Thanks Mike for review. This bug is presently tracking two changes one with i18n file and other with UTF-8 file. Both changes are significant so for better tracking i created new bug https://sourceware.org/bugzilla/show_bug.cgi?id=17588 for UTF-8 file. I will submit respective patches there.

i18n ctype is still pending.

Comment 27 Mike FABIAN 2014-11-14 07:15:46 UTC

Created attachment 7931 [details]
gen-unicode-ctype.py

Python rewrite of Bruno Haible’s gen-unicode-ctype.c.

This version produces *exactly* the same output as the C program:
    
    $ gcc -o gen-unicode-ctype gen-unicode-ctype.c
    $ ./gen-unicode-ctype UnicodeData.txt 7.0.0
    $ ./gen-unicode-ctype.py -u UnicodeData.txt -o unicode-new --unicode_version 7.0.0
    $ diff -u unicode unicode-new
    $

Comment 28 Mike FABIAN 2014-11-14 07:21:13 UTC

Created attachment 7932 [details]
gen-unicode-ctype.py

Improved version of gen-unicode-ctype.py which also parses
DerivedCoreProperties.txt and uses it (partly) for is_alpha(),
is_lower(), and is_upper().

"partly" because of 1):

            # Consider all the non-ASCII digits as alphabetic.
            # ISO C 99 forbids us to have them in category “digit”,
            # but we want iswalnum to return true on them.

These digits are not “Alphabetic” in DerivedCoreProperties.txt
but it seems to makes sense to treat them as alpha according
to this comment by Bruno.

and 2):
    title case characters are treated as both upper *and* lower.

Comment 29 Mike FABIAN 2014-11-14 07:24:02 UTC

Created attachment 7933 [details]
report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0

Comment 30 Mike FABIAN 2014-11-14 07:34:20 UTC

(In reply to Mike FABIAN from comment #29)
> Created attachment 7933 [details]
> report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0

From this report:

alpha: Missing: ⒜ 0x249c PARENTHESIZED LATIN SMALL LETTER A
...

These are *not* “Alphabetic” in DerivedCoreProperties.txt, therefore
it is correct to remove them.

978 characters have been removed from “punct” which are now in “alpha”
because of DerivedCoreProperties.txt.

Number of errors in new file = 11:

These are only errors like:

error: 0xe2f ฯ alpha True: FIXME: Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
            <U0E2F>, <U0E46> should belong to punct. DerivedCoreProperties.txt
            says it is alpha.
...
error: 0xe4e ๎ alpha False: FIXME: gen-unicode-ctype.c: Theppitak Karoonboonyanan
            <thep@links.nectec.or.th> says <U0E47>..<U0E4E> are
            is_alpha. DerivedCoreProperties does *not*.

I wrote mail to Theppitak Karoonboonyanan <thep@links.nectec.or.th>
and Bruno, The mail to thep@links.nectec.or.th bounced and I did not
get an answer from Bruno.

I think it is better to trust DerivedCoreProperties.txt here, so I don’t
think these are errors.

So I think my updated gen-unicode-ctype.py produces the character
classes correctly (as far as possible with the limitations caused by
glibc and ISO C 99).

Comment 31 Mike FABIAN 2014-11-14 07:36:03 UTC

I think I should probably do another update to gen-unicode-ctype.py
to read in the original “i18n” file of glibc and write out a new
one replacing the character classes to avoid having to do cut and paste
manually.

Comment 32 Mike FABIAN 2014-11-24 11:19:53 UTC

(In reply to Mike FABIAN from comment #23)

> 3) it does not put some characters like:
> 
>     upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND
> PROSGEGRAMMENI
> 
> into “upper”. Surprisingly,
> 
> “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
> is *not* listed as “Uppercase” in
> http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .
> 
> Although U+1F80 seems to be Uppercase according to
> http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> because it has a tolower mapping to U+1F80:
> 
>     1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
> 0345;;;;N;;;1F88;;1F88
>     1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND
> PROSGEGRAMMENI;Lt;0;L;1F08 0345;;;;N;;;;1F80;
> 
> So this might be a bug in DerivedCoreProperties.txt.

It is not a bug in DerivedCoreProperties.txt, I asked on the Unicode
mailing list:

http://www.unicode.org/mail-arch/unicode-ml/y2014-m11/0010.html

So these are actually title case as well.

That means, because of the restrictions of ISO C 99, these title
characters should be both in the “upper” and “lower” character class
in LC_CTYPE (my gen-unicode-ctype.py from comment#28 does this).

Comment 33 Mike FABIAN 2014-12-01 10:14:54 UTC

Created attachment 7979 [details]
gen-unicode-ctype.py

New version of gen-unicode-ctype.py which can read the head and tail
of the original i18n file.  To avoid having to cut and paste the
generated LC_CTYPE character classes into the new glibc i18n file,
read the old file as well. Copy everything from the old file to the
newly generated file except the LC_CTYPE character class data, which
are generated from the UnicodeData.txt and DerivedCoreProperties.txt
given.

Comment 34 Mike FABIAN 2014-12-03 09:59:12 UTC

When I generate a new glibc/localedata/locales/i18n file
using gen-unicode-ctype.py from comment#33 and build
glibc with that and then run the tests with “make check”, I get
one failure:

    FAIL: localedata/tst-ctype

Looking why it fails I find in ./localedata/tst-ctype.out:

    Locale-specific tests for `lower'
      islower('ª' = '\xaa') is true
      islower('º' = '\xba') is true
    Locale-specific tests for `lower'
    ...
    2 errors for `de_DE.ISO-8859-1' locale

The new “lower” character class generated by gen-unicode-ctype.py
contains U+00AA ª FEMININE ORDINAL INDICATOR and U+00BA º MASCULINE
ORDINAL INDICATOR.

The test tst-ctype run by “make check” wants them *not* to be lower case.

DerivedCoreProperties.txt lists both as lower case though:

    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

That’s why gen-unicode-ctype.py adds them to the “lower” character
class, it adds all characters found in DerivedCoreProperties.txt
marked as “Lowercase” to the character class “lower”.

I wonder what needs to be done here.

Is the test in glibc wrong?

If so, it could be fixed by a patch like this:

$ git show | iconv -f iso-8859-1 -t utf-8
commit 25c913674386011a44b6270579a894b2e8200d25
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Dec 3 10:05:42 2014 +0100

    Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in
    
    DerivedCoreProperties.txt from Unicode 7.0.0 lists
    the characters U+00AA (Âª) and U+00BA (Âº) as lower case:
    
    00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
    00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR

diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in b/localedata/tst-ctype-de_DE.ISO-8859-1.in
index f71d76c..e124a52 100644
--- a/localedata/tst-ctype-de_DE.ISO-8859-1.in
+++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in
@@ -1,5 +1,5 @@
 lower    ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
-        000000000000000000000100000000000000000000000000
+        000000000010000000000100001000000000000000000000
 lower   ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
         000000000000000111111111111111111111111011111111
 upper    ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ

Comment 35 Mike FABIAN 2014-12-03 12:27:20 UTC

Created attachment 7988 [details]
0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch

Comment 36 Mike FABIAN 2014-12-03 12:27:47 UTC

Created attachment 7989 [details]
0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch

Comment 37 Mike FABIAN 2014-12-04 10:33:00 UTC

*** Bug 14010 has been marked as a duplicate of this bug. ***

Comment 38 Sourceware Commits 2015-02-20 22:36:45 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 (commit)
      from  e4a399dc3dbb3228eb39af230ad11bc42a018c93 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a4839c94a4c93ffc0d5b95c69a08b02a57007f2

commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2
Author: Alexandre Oliva <aoliva@redhat.com>
Date:   Fri Feb 20 20:14:59 2015 -0200

    Unicode 7.0.0 update; added generator scripts.
    
    for  localedata/ChangeLog
    
    	[BZ #17588]
    	[BZ #13064]
    	[BZ #14094]
    	[BZ #17998]
    	* unicode-gen/Makefile: New.
    	* unicode-gen/unicode-license.txt: New, from Unicode.
    	* unicode-gen/UnicodeData.txt: New, from Unicode.
    	* unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
    	* unicode-gen/EastAsianWidth.txt: New, from Unicode.
    	* unicode-gen/gen_unicode_ctype.py: New generator, from Mike
    	FABIAN <mfabian@redhat.com>.
    	* unicode-gen/ctype_compatibility.py: New verifier, from
    	Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
    	* unicode-gen/ctype_compatibility_test_cases.py: New verifier
    	module, from Mike FABIAN.
    	* unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
    	and Mike FABIAN.
    	* unicode-gen/utf8_compatibility.py: New verifier, from Pravin
    	Satpute and Mike FABIAN.
    	* charmaps/UTF-8: Update.
    	* locales/i18n: Update.
    	* gen-unicode-ctype.c: Remove.
    	* tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
    	true for ordinal indicators.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                                               |   11 +-
 localedata/ChangeLog                               |   27 +
 localedata/charmaps/UTF-8                          |11946 ++++++---
 localedata/gen-unicode-ctype.c                     |  784 -
 localedata/locales/i18n                            | 2652 +-
 localedata/tst-ctype-de_DE.ISO-8859-1.in           |    2 +-
 localedata/unicode-gen/DerivedCoreProperties.txt   |10794 ++++++++
 localedata/unicode-gen/EastAsianWidth.txt          | 2121 ++
 localedata/unicode-gen/Makefile                    |   99 +
 localedata/unicode-gen/UnicodeData.txt             |27268 ++++++++++++++++++++
 localedata/unicode-gen/ctype_compatibility.py      |  546 +
 .../unicode-gen/ctype_compatibility_test_cases.py  |  951 +
 localedata/unicode-gen/gen_unicode_ctype.py        |  751 +
 localedata/unicode-gen/unicode-license.txt         |   50 +
 localedata/unicode-gen/utf8_compatibility.py       |  399 +
 localedata/unicode-gen/utf8_gen.py                 |  286 +
 16 files changed, 53305 insertions(+), 5382 deletions(-)
 delete mode 100644 localedata/gen-unicode-ctype.c
 create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt
 create mode 100644 localedata/unicode-gen/EastAsianWidth.txt
 create mode 100644 localedata/unicode-gen/Makefile
 create mode 100644 localedata/unicode-gen/UnicodeData.txt
 create mode 100755 localedata/unicode-gen/ctype_compatibility.py
 create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py
 create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py
 create mode 100644 localedata/unicode-gen/unicode-license.txt
 create mode 100755 localedata/unicode-gen/utf8_compatibility.py
 create mode 100755 localedata/unicode-gen/utf8_gen.py

Comment 39 Alexandre Oliva 2015-02-21 20:24:28 UTC

Fixed

Comment 40 Egmont Koblinger 2016-03-22 09:29:44 UTC

Please see bug 19852 for a followup issue.