The Unicode locale data - character map and LC_CTYPE information - should be updated from Unicode 6.1 (the character map is currently based on 6.0, and LC_CTYPE is currently based on 5.0). This should be done with proper automation and wiki documentation being added of how to do future updates. I identified the following tasks at <http://sourceware.org/ml/libc-alpha/2012-05/msg00590.html>: * Ensure the character type data in localedata/charmaps/i18n can be properly reproduced from Unicode 5.0 data using gen-unicode-ctype.c, adapting gen-unicode-ctype.c as needed to replicate any changes that may have been made not using that program. * Update the character type data to Unicode 6.1, removing any local hacks from gen-unicode-ctype.c that are no longer needed. (10646:2012, corresponding to Unicode 6.1, appears to be in publication stage so should be out very soon.) * Ensure the character data in localedata/charmaps/UTF-8 can be reproduced in some automated fashion from Unicode 6.0, locating any previously used automation for this or creating some new automation if any previous automation can't be found. * Update the character data to Unicode 6.1, removing any local hacks in the automation from the previous step. * Document thoroughly on the wiki how the automation works and how to do updates to new Unicode versions.
One of the major "local hacks" can be fixed, fixing many other problems at the same time, by switching to using the Unicode "Alphabetic" property (from DerivedCoreProperties.txt) instead of just categories L* for class alpha. Right now there are many languages whose letters are considered non-alphabetic by glibc because they're in category Mn or Mc or even Cf. There are "local hacks" to fix this for maybe one or two languages, but using the right Unicode property would fix it for all languages.
*** Bug 16969 has been marked as a duplicate of this bug. ***
Rather than Uniocode 6.1, it should be Unicode 6.3. Two files as mentioned in bug are 1. i18n (LC_CTYPE) (it used to be generated by gen-unicode-ctype.c, ) 2. UTF-8 (it looks conversion from Unicode to UTF-8), i will find out Are there any other files also involved in upgrading glibc localedata to Unicode 6.1?
Once the data is updated (maybe once just the character map is updated), __STDC_ISO_10646__ should be updated in include/stdc-predef.h to reflect the publication date of the edition or amendment to ISO 10646 corresponding to the version of Unicode in use. I advise keeping each of the tasks I listed as a separate patch, as it's important to be confident we aren't losing desired local changes in the course of the update (which means the existing files need to be reproduced exactly by some automation before the update is done). Bug 16061 relates to transliteration data, some of which came from Unicode, and bug 14095 to collation data. The same principles apply to those - reproduce the existing files, understanding any local changes in the process, then update to a newer Unicode version - but they are likely to involve much more work in understanding the existing state then updating while preserving any desired local changes.
Yeah, Backward compatibility is must. I will write small script to check we are not changing existing maps, so we can be confident before commiting.
I have written script for checking backward compabitibility of new LC_CTYPE with old LC_CTYPE. Script is available at https://github.com/pravins/glibc-i18n Important thing for us presently is report generated by script. i.e. https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report While doing this also found in existing i18n file <U0D70>..<U0D75>; included twice. % MALAYALAM/ <U0D66>..<U0D75>;<U0D70>..<U0D75>;/ Let me know if anything is missing. In next step, i will check missing characters from LC_CTYPE 5.0.0 with LC_CTYPE 6.3.0 and confirm are these intentional changes at Unicode or something we are missing. Will be ready with patch for updating LC_CTYPE next time.
(In reply to Pravin S from comment #6) > I have written script for checking backward compabitibility of new LC_CTYPE > with old LC_CTYPE. > > Script is available at https://github.com/pravins/glibc-i18n > > Important thing for us presently is report generated by script. i.e. > > https://raw.githubusercontent.com/pravins/glibc-i18n/master/Report > > While doing this also found in existing i18n file <U0D70>..<U0D75>; included > twice. > > % MALAYALAM/ > <U0D66>..<U0D75>;<U0D70>..<U0D75>;/ > > Let me know if anything is missing. > > In next step, i will check missing characters from LC_CTYPE 5.0.0 with > LC_CTYPE 6.3.0 and confirm are these intentional changes at Unicode or > something we are missing. > > Will be ready with patch for updating LC_CTYPE next time. Thanks Pravin! I think the missing step is to get these scripts checked into glibc's script/ directory so that we have them in a central location with some internal comments showing how to run the script. This way we can re-run them at later stages to verify what's missing and stay in sync (say the release manager runs it before a release). Eventually we want a documented process here: https://sourceware.org/glibc/wiki/Regeneration Even if it's just "Run this script. Fix all warnings by hand" it would be a good start.
Agree with you, will do it.
(In reply to Rich Felker from comment #1) > One of the major "local hacks" can be fixed, fixing many other problems at > the same time, by switching to using the Unicode "Alphabetic" property (from > DerivedCoreProperties.txt) instead of just categories L* for class alpha. > Right now there are many languages whose letters are considered > non-alphabetic by glibc because they're in category Mn or Mc or even Cf. > There are "local hacks" to fix this for maybe one or two languages, but > using the right Unicode property would fix it for all languages. I was almost done with things bug While updating this, i found around 248 characters were added after gen-unicode-ctype.c processing in ALPHA group in present i18n CTYPE (Unicode 5.1 https://github.com/pravins/glibc-i18n/blob/master/unicode5-1/Report ) and i am facing same issue while upgrading it to Unicode 6.3 (246 characters) (https://github.com/pravins/glibc-i18n/blob/master/Report) During reading http://www.unicode.org/reports/tr44/#Property_List_Table It is mentioned "Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so." I agree with Rich, We should collect available things from DerivedCoreProperties.txt rather than processing raw UnicodeData.txt. I am writing script to process groups from DerivedCoreProperties.txt
I am working with latest Unicode standard, so updated bug summary.
(In reply to Joseph Myers from comment #0) > > * Ensure the character data in localedata/charmaps/UTF-8 can be > reproduced in some automated fashion from Unicode 6.0, locating any > previously used automation for this or creating some new automation > if any previous automation can't be found. Me too not able to find previous automation for same. I can simply pass all Unicode to python unicode-to-utf8 and format it as required by UTF-8 file. Any hint on how to do this?
(In reply to Pravin S from comment #11) > (In reply to Joseph Myers from comment #0) > > > > * Ensure the character data in localedata/charmaps/UTF-8 can be > > reproduced in some automated fashion from Unicode 6.0, locating any > > previously used automation for this or creating some new automation > > if any previous automation can't be found. > > Me too not able to find previous automation for same. > > I can simply pass all Unicode to python unicode-to-utf8 and format it as > required by UTF-8 file. > > Any hint on how to do this? Not really, this is why this problem requires "work" ;-)
Created attachment 7679 [details] Patch to update UTF-8 CHARMAP to unicode 7.0 I have worked on updating UTF-8 file to Unicode 7.0. Following are the important points before review this patch. 1. Present patch is only for CHARMAP, patch for updating WIDTH will be available soon. 2. utf8-gen.py: New script to generate UTF-8 file. 3. patch is created by ignoring space changes (-w) 4. ''' Where UnicodeData.txt file has given characters in range Example: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; UTF-8 file mention these range by adding 0x3F inbetween First and Last Unicode character. Example: <U3400>..<U343F> /xe3/x90/x80 <CJK Ideograph Extension A> . . <U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A> * Note: No idea why Hangul syllable AC00; D7A3; were not expanded in Unicode ** ** 5.0 UTF-8. We are following consistency and expanding Hangul as well.** * ''' 5. Name changes are in UnicodeData.txt in some cases. ''' Some characters have <control> as a name, so using "Unicode 1.0 Name" Characters U+0080, U+0081, U+0084 and U+0099 has "<control>" as a name and even no "Unicode 1.0 Name" (10th field) in UnicodeData.txt We can write code to take there alternate name from NameAliases.txt '''
Created attachment 7715 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 Done with all work with UTF-8 file. Added two script: 1. utf8-gen.py to generate UTF-8 file 2. utf8-compatibility.py : to check backward compatibility of newly generated UTF-8 file 3. Report of new UTF-8 file backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8 Submitting to glibc-alpha, please help to quick review and push to git.
Created attachment 7720 [details] Patch to update UTF-8 i18n file (CTYPE) to unicode 7.0 Patch does the following stuff: * locales/i18n: Updated to Unicode 7.0.0 * scripts/gen-unicode-ctype.c: Disabled upper, lower, alpha and outdigit classes. * scripts/ctype-gen.sh: Shell script to generate LC_CTYPE for new Unicode version. * scripts/gen-unicode-ctype-dcp.py: New script for generating locales/i18n upper, lower and alpha ctype from DerivedCoreProperties.txt * scripts/ctype-compatibility.py: Script for testing testing backward compatibility of LC_CTYPE locales/i18n. Report for backward compatibility is available at https://raw.githubusercontent.com/pravins/glibc-i18n/master/unicode7-0/ctype-compatibility5_1-to-7_0
Pravin, Is any part of your work ready for 2.21 when it opens?
I am still waiting for someone to review these patches. Best way will be, 1. Build glibc with patches. 2. Test WIDTH and CTYPE function (does it return proper value) may be one can do same with existing glibc and compare.
(In reply to Pravin S from comment #14) > Created attachment 7715 [details] > Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 > > Done with all work with UTF-8 file. > Added two script: > 1. utf8-gen.py to generate UTF-8 file > 2. utf8-compatibility.py : to check backward compatibility of newly > generated UTF-8 file > 3. Report of new UTF-8 file backward compatibility is available AT > https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8 > > Submitting to glibc-alpha, please help to quick review and push to git. I checked the scripts Pravin used and the resulting UTF-8 file. I found only one minor problem: In some cases, both UnicodeData.txt and EastAsianWidth.txt have information about width. For example, EastAsianWidth.txt has: 302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC ENTERING TONE MARK which gives us width 2 for these 4 characters (because of “W”) but UnicodeData.txt has: 302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;; 302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;; 302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;; 302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;; which would give width 0 (because of “NSM”). I changed Pravin’s script a bit to prefer the information from EastAsianWidth.txt in case of conflicts. Pravin has already merged my change into his git repository.
I extended Pravin’s ctype-compatibility.py script to produce more human readable output and added many extra tests. Joseph Myers> * Ensure the character type data in Joseph Myers> localedata/charmaps/i18n can be properly reproduced from Joseph Myers> Unicode 5.0 data using gen-unicode-ctype.c, adapting Joseph Myers> gen-unicode-ctype.c as needed to replicate any changes Joseph Myers> that may have been made not using that program. When using gen-unicode-ctype.c with UnicodeData.txt-5.0.0 to generate LC_CTYPE, the generated file lacks many characters which apparently have been manually added to glibc’s i18n file: alpha: Missing 1238 characters of old ctype in new ctype blank: Missing 0 characters of old ctype in new ctype cntrl: Missing 0 characters of old ctype in new ctype combining: Missing 124 characters of old ctype in new ctype combining_level3: Missing 49 characters of old ctype in new ctype digit: Missing 0 characters of old ctype in new ctype graph: Missing 1571 characters of old ctype in new ctype lower: Missing 115 characters of old ctype in new ctype print: Missing 1571 characters of old ctype in new ctype punct: Missing 335 characters of old ctype in new ctype space: Missing 0 characters of old ctype in new ctype tolower: Missing 19 characters of old ctype in new ctype totitle: Missing 8 characters of old ctype in new ctype toupper: Missing 18 characters of old ctype in new ctype upper: Missing 100 characters of old ctype in new ctype xdigit: Missing 0 characters of old ctype in new ctype I.e. reproducing the localedata/charmaps/i18n character type data from Unicode 5.0 data using gen-unicode-ctype.c does not work well because glibc’s i18n file apparently has been edited manually a lot already to include newer Unicode data. Apparently quite a few mistake have been made by manually editing the i18n file. For example, the report from ctype-compatibility.py also produces for the old i18n file: error: 0xa67f ꙿ punct True: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is “Alphabetic”. Apparently added manually to punct by mistake in glibc’s old LC_CTYPE. error: 0xa67f ꙿ alpha False: 0xa67f CYRILLIC PAYEROK. Not in Unicode 5.0.0. In Unicode 7.0.0. General category Lm (Letter modifier). DerivedCoreProperties.txt says it is “Alphabetic”. Apparently added manually to punct by mistake in glibc’s old LC_CTYPE. Another example: error: 0x9f4 ৴ alpha True: “09F4;BENGALI CURRENCY NUMERATOR ONE;No;0;L;;;;1/16;N;;;;;” “09F5;BENGALI CURRENCY NUMERATOR TWO;No;0;L;;;;1/8;N;;;;;” “09F6;BENGALI CURRENCY NUMERATOR THREE;No;0;L;;;;3/16;N;;;;;” “09F7;BENGALI CURRENCY NUMERATOR FOUR;No;0;L;;;;1/4;N;;;;;” “09F8;BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR;No;0;L;;;;3/4;N;;;;;” “09F9;BENGALI CURRENCY DENOMINATOR SIXTEEN;No;0;L;;;;16;N;;;;;” “09FA;BENGALI ISSHAR;So;0;L;;;;;N;;;;;” According to DerivedCoreProperties.txt (7.0.0) these are *not* “Alphabetic”. So this has been mistakenly added to “alpha” in the old i18n file of glibc (but gen-unicode-ctype.c correctly puts in into “punct”, i.e. this seems to be another mistake by manual editing). Some of the errors reported by ctype-compatibility.py error: 0x250 ɐ lower False: Should be lower in Unicode 7.0.0 (was not lower in Unicode 5.0.0). would be fixed by using gen-unicode-ctype.c with Unicode 7.0.0 input. There are many more problems like this in the old i18n file, my tests found 133 errors total: ------------------------------------------------------------ Old file = /local/mfabian/src/glibc/localedata/locales/i18n Number of errors in old file = 133 ------------------------------------------------------------ I’ll attach the full report.
Created attachment 7907 [details] unicode-5.0.0-report-full-output Full report from ctype-compatibility.py when comparing the old i18n file in glibc with the file generated by gen-unicode-ctype.c using UnicodeData.txt from Unicode 5.0.0.
Now when using gen-unicode-ctype.c with UnicodeData.txt-7.0.0 to generate LC_CTYPE, the generated file lacks far fewer characters compared to the old i18n file in glibc: alpha: Missing 246 characters of old ctype in new ctype blank: Missing 1 characters of old ctype in new ctype cntrl: Missing 0 characters of old ctype in new ctype combining: Missing 3 characters of old ctype in new ctype combining_level3: Missing 5 characters of old ctype in new ctype digit: Missing 0 characters of old ctype in new ctype graph: Missing 0 characters of old ctype in new ctype lower: Missing 20 characters of old ctype in new ctype print: Missing 0 characters of old ctype in new ctype punct: Missing 16 characters of old ctype in new ctype space: Missing 1 characters of old ctype in new ctype tolower: Missing 0 characters of old ctype in new ctype totitle: Missing 0 characters of old ctype in new ctype toupper: Missing 0 characters of old ctype in new ctype upper: Missing 0 characters of old ctype in new ctype xdigit: Missing 0 characters of old ctype in new ctype For example, gen-unicode-ctype.c does not put U+0901 into the “alpha” class although it should be there according to DerivedCoreProperties.txt: error: 0x901 ँ alpha False: These have general category “Mn” i.e. these are combining characters (both in UnicodeData.txt 5.0.0 and 7.0.0): “0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;”, ”0902;DEVANAGARI SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;”, “0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;”. According to DerivedCoreProperties.txt (7.0.0) these are “Alphabetic”. Apparently this has been edited manually (correctly) in the old i18n file of glibc. So this would be fixed in the automatic generation when using DerivedCoreProperties.txt for “alpha”. But some of the above seem to be errors in the old i18n file of glib, for example: error: 0x1090 ႐ punct True: MYANMAR SHAN DIGIT ZERO - MYANMAR SHAN DIGIT NINE. These are digits, but because ISO C 99 forbids to put them into digit they should go into alpha. This is in “punct” in the old i18n file but gen-unicode-ctype.c would put it into “alpha” which seems better for such digits according to the comments in gen-unicode-ctype.c. I went through all these “Missing” characters individually and looked them up in UnicodeData.txt and DerivedCoreProperties.txt, checked what how should be classified and added test cases for them to the ctype-compatibility.py script. I’ll attach the full report after using gen-unicode-ctype.c with UnicodeData.txt-7.0.0 to generate LC_CTYPE.
Created attachment 7908 [details] unicode-7.0.0-report-full-output Full report from ctype-compatibility.py when comparing the old i18n file in glibc with the file generated by gen-unicode-ctype.c using UnicodeData.txt from Unicode 7.0.0.
Now Pravin’s approach in the patch attached to comment#15 is to comment out the generation of “upper”, “lower” and “alpha” from gen-unicode-ctype.c and add another script gen-unicode-ctype-dcp.py which adds these. But this is a bit problematic. 1) it does not put digits like alpha: Missing: ٠ 0x660 ARABIC-INDIC DIGIT ZERO into “alpha”, which gen-unicode-ctype.c would have done. gen-unicode-ctype.c contains the comment /* Consider all the non-ASCII digits as alphabetic. ISO C 99 forbids us to have them in category "digit", but we want iswalnum to return true on them. */ which sounds reasonable. 2) it does not put characters like lower: Missing: Dž 0x1c5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON into lower. This is actually title case, not lower case, but glibc does have only “lower” and “upper”, not “title”. Although it has “toupper”, “tolower”, and “totitle”. gen-unicode-ctype.c puts characters which change when “toupper” is applied into “lower” and characters which change when “tolower” is applied into “upper”. Therefore, gen-unicode-ctype.c puts title case characters like Dž 0x1c5 into *both*, “upper” *and* “lower”. Which seems reasonable if glibc has no “title”. 3) it does not put some characters like: upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI into “upper”. Surprisingly, “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” is *not* listed as “Uppercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345;;;;N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345;;;;N;;;;1F80; So this might be a bug in DerivedCoreProperties.txt. Generating “upper” and “lower” the way gen-unicode-ctype.c does, i.e. just using UnicodeData.txt and check whether characters change when mapping them to upper or to lower does not produce this error. I think the approach gen-unicode-ctype.c uses for “upper” and “lower” is fine, it is not necessary to use DerivedCoreProperties.txt for this. 4) *many* characters end up being in “alpha” *and* “punct” For example: error: ⷶ 0x2df6 is alpha and punct gen-unicode-ctype.c has the comment: /* alpha restriction: "No character specified for the keywords cntrl, digit, punct or space shall be specified." */ This restriction is violated because the the second script gen-unicode-ctype-dcp.py used in Pravin’s 2-pass approach does not check whether gen-unicode-ctype.c has already put a character into “punct” before putting it into “alpha”. The character “ⷶ U+2df6 COMBINING CYRILLIC LETTER A” is “Alphabetic” according to DerivedCoreProperties.txt: 2DE0..2DFF ; Alphabetic # Mn [32] COMBINING CYRILLIC LETTER BE..COMBINING CYRILLIC LETTER IOTIFIED BIG YUS So Pravin’s script does rightly put it in to “alpha”. But looking at this, it seems not a good idea to have two independent programs generating the file in 2 independent passes. Verifications like gen-unicode-ctype.c does: /* toupper restriction: "Only characters specified for the keywords lower and upper shall be specified. */ ... /* tolower restriction: "Only characters specified for the keywords lower and upper shall be specified. */ ... /* alpha restriction: "Characters classified as either upper or lower shall automatically belong to this class. */ ... /* alpha restriction: "No character specified for the keywords cntrl, digit, punct or space shall be specified." */ ... /* space restriction: "No character specified for the keywords upper, lower, alpha, digit, graph or xdigit shall be specified." upper, lower, alpha already checked above. */ ... /* cntrl restriction: "No character specified for the keywords upper, lower, alpha, digit, punct, graph, print or xdigit shall be specified." upper, lower, alpha already checked above. */ ... can be done much easier when using a single program.
So I think we should do either: 1) improve gen-unicode-ctype.c and make it use DerivedCoreProperties.txt for “alpha” or: 2) rewrite gen-unicode-ctype.c to Python First a rewrite which produces *exactly* the same output as gen-unicode-ctype.c, then add code to use DerivedCoreProperties.txt for “alpha” No matter whether extending the C-Program or writing a Python program, it should be a single program to be able to verify the restrictions mentioned easily. It would be nice of course to make the program read in the old i18n file and replace the characters classes and write out a new file which keeps the rest of the original file so that no manual copy&paste of the generated character classes is necessary.
(In reply to Mike FABIAN from comment #24) > No matter whether extending the C-Program or writing a Python program, > it should be a single program to be able to verify the restrictions > mentioned easily. And as a 2nd pass, after the single program to generate the character class data, use ctype-compatibility.py as a "test-suite".
(In reply to Mike FABIAN from comment #18) > (In reply to Pravin S from comment #14) > > Created attachment 7715 [details] > > Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 > > > > Done with all work with UTF-8 file. > > Added two script: > > 1. utf8-gen.py to generate UTF-8 file > > 2. utf8-compatibility.py : to check backward compatibility of newly > > generated UTF-8 file > > 3. Report of new UTF-8 file backward compatibility is available AT > > https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8 > > > > Submitting to glibc-alpha, please help to quick review and push to git. > > I checked the scripts Pravin used and the resulting UTF-8 file. > > I found only one minor problem: > > In some cases, both UnicodeData.txt and EastAsianWidth.txt have information > about width. For example, EastAsianWidth.txt has: > > 302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC > ENTERING TONE MARK > > which gives us width 2 for these 4 characters (because of “W”) but > UnicodeData.txt has: > > 302A;IDEOGRAPHIC LEVEL TONE MARK;Mn;218;NSM;;;;;N;;;;; > 302B;IDEOGRAPHIC RISING TONE MARK;Mn;228;NSM;;;;;N;;;;; > 302C;IDEOGRAPHIC DEPARTING TONE MARK;Mn;232;NSM;;;;;N;;;;; > 302D;IDEOGRAPHIC ENTERING TONE MARK;Mn;222;NSM;;;;;N;;;;; > > which would give width 0 (because of “NSM”). > > I changed Pravin’s script a bit to prefer the information from > EastAsianWidth.txt in case of conflicts. > > Pravin has already merged my change into his git repository. Thanks Mike for review. This bug is presently tracking two changes one with i18n file and other with UTF-8 file. Both changes are significant so for better tracking i created new bug https://sourceware.org/bugzilla/show_bug.cgi?id=17588 for UTF-8 file. I will submit respective patches there. i18n ctype is still pending.
Created attachment 7931 [details] gen-unicode-ctype.py Python rewrite of Bruno Haible’s gen-unicode-ctype.c. This version produces *exactly* the same output as the C program: $ gcc -o gen-unicode-ctype gen-unicode-ctype.c $ ./gen-unicode-ctype UnicodeData.txt 7.0.0 $ ./gen-unicode-ctype.py -u UnicodeData.txt -o unicode-new --unicode_version 7.0.0 $ diff -u unicode unicode-new $
Created attachment 7932 [details] gen-unicode-ctype.py Improved version of gen-unicode-ctype.py which also parses DerivedCoreProperties.txt and uses it (partly) for is_alpha(), is_lower(), and is_upper(). "partly" because of 1): # Consider all the non-ASCII digits as alphabetic. # ISO C 99 forbids us to have them in category “digit”, # but we want iswalnum to return true on them. These digits are not “Alphabetic” in DerivedCoreProperties.txt but it seems to makes sense to treat them as alpha according to this comment by Bruno. and 2): title case characters are treated as both upper *and* lower.
Created attachment 7933 [details] report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0
(In reply to Mike FABIAN from comment #29) > Created attachment 7933 [details] > report-gen-unicode-ctype.py-DerivedCoreProperties-7.0.0 From this report: alpha: Missing: ⒜ 0x249c PARENTHESIZED LATIN SMALL LETTER A ... These are *not* “Alphabetic” in DerivedCoreProperties.txt, therefore it is correct to remove them. 978 characters have been removed from “punct” which are now in “alpha” because of DerivedCoreProperties.txt. Number of errors in new file = 11: These are only errors like: error: 0xe2f ฯ alpha True: FIXME: Theppitak Karoonboonyanan <thep@links.nectec.or.th> says <U0E2F>, <U0E46> should belong to punct. DerivedCoreProperties.txt says it is alpha. ... error: 0xe4e ๎ alpha False: FIXME: gen-unicode-ctype.c: Theppitak Karoonboonyanan <thep@links.nectec.or.th> says <U0E47>..<U0E4E> are is_alpha. DerivedCoreProperties does *not*. I wrote mail to Theppitak Karoonboonyanan <thep@links.nectec.or.th> and Bruno, The mail to thep@links.nectec.or.th bounced and I did not get an answer from Bruno. I think it is better to trust DerivedCoreProperties.txt here, so I don’t think these are errors. So I think my updated gen-unicode-ctype.py produces the character classes correctly (as far as possible with the limitations caused by glibc and ISO C 99).
I think I should probably do another update to gen-unicode-ctype.py to read in the original “i18n” file of glibc and write out a new one replacing the character classes to avoid having to do cut and paste manually.
(In reply to Mike FABIAN from comment #23) > 3) it does not put some characters like: > > upper: Missing: ᾈ 0x1f88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND > PROSGEGRAMMENI > > into “upper”. Surprisingly, > > “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” > is *not* listed as “Uppercase” in > http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . > > Although U+1F80 seems to be Uppercase according to > http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt > because it has a tolower mapping to U+1F80: > > 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 > 0345;;;;N;;;1F88;;1F88 > 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND > PROSGEGRAMMENI;Lt;0;L;1F08 0345;;;;N;;;;1F80; > > So this might be a bug in DerivedCoreProperties.txt. It is not a bug in DerivedCoreProperties.txt, I asked on the Unicode mailing list: http://www.unicode.org/mail-arch/unicode-ml/y2014-m11/0010.html So these are actually title case as well. That means, because of the restrictions of ISO C 99, these title characters should be both in the “upper” and “lower” character class in LC_CTYPE (my gen-unicode-ctype.py from comment#28 does this).
Created attachment 7979 [details] gen-unicode-ctype.py New version of gen-unicode-ctype.py which can read the head and tail of the original i18n file. To avoid having to cut and paste the generated LC_CTYPE character classes into the new glibc i18n file, read the old file as well. Copy everything from the old file to the newly generated file except the LC_CTYPE character class data, which are generated from the UnicodeData.txt and DerivedCoreProperties.txt given.
When I generate a new glibc/localedata/locales/i18n file using gen-unicode-ctype.py from comment#33 and build glibc with that and then run the tests with “make check”, I get one failure: FAIL: localedata/tst-ctype Looking why it fails I find in ./localedata/tst-ctype.out: Locale-specific tests for `lower' islower('ª' = '\xaa') is true islower('º' = '\xba') is true Locale-specific tests for `lower' ... 2 errors for `de_DE.ISO-8859-1' locale The new “lower” character class generated by gen-unicode-ctype.py contains U+00AA ª FEMININE ORDINAL INDICATOR and U+00BA º MASCULINE ORDINAL INDICATOR. The test tst-ctype run by “make check” wants them *not* to be lower case. DerivedCoreProperties.txt lists both as lower case though: 00AA ; Lowercase # Lo FEMININE ORDINAL INDICATOR 00BA ; Lowercase # Lo MASCULINE ORDINAL INDICATOR That’s why gen-unicode-ctype.py adds them to the “lower” character class, it adds all characters found in DerivedCoreProperties.txt marked as “Lowercase” to the character class “lower”. I wonder what needs to be done here. Is the test in glibc wrong? If so, it could be fixed by a patch like this: $ git show | iconv -f iso-8859-1 -t utf-8 commit 25c913674386011a44b6270579a894b2e8200d25 Author: Mike FABIAN <mfabian@redhat.com> Date: Wed Dec 3 10:05:42 2014 +0100 Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in DerivedCoreProperties.txt from Unicode 7.0.0 lists the characters U+00AA (ª) and U+00BA (º) as lower case: 00AA ; Lowercase # Lo FEMININE ORDINAL INDICATOR 00BA ; Lowercase # Lo MASCULINE ORDINAL INDICATOR diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in b/localedata/tst-ctype-de_DE.ISO-8859-1.in index f71d76c..e124a52 100644 --- a/localedata/tst-ctype-de_DE.ISO-8859-1.in +++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in @@ -1,5 +1,5 @@ lower ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ - 000000000000000000000100000000000000000000000000 + 000000000010000000000100001000000000000000000000 lower ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ 000000000000000111111111111111111111111011111111 upper ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
Created attachment 7988 [details] 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch
Created attachment 7989 [details] 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch
*** Bug 14010 has been marked as a duplicate of this bug. ***
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 (commit) from e4a399dc3dbb3228eb39af230ad11bc42a018c93 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 Author: Alexandre Oliva <aoliva@redhat.com> Date: Fri Feb 20 20:14:59 2015 -0200 Unicode 7.0.0 update; added generator scripts. for localedata/ChangeLog [BZ #17588] [BZ #13064] [BZ #14094] [BZ #17998] * unicode-gen/Makefile: New. * unicode-gen/unicode-license.txt: New, from Unicode. * unicode-gen/UnicodeData.txt: New, from Unicode. * unicode-gen/DerivedCoreProperties.txt: New, from Unicode. * unicode-gen/EastAsianWidth.txt: New, from Unicode. * unicode-gen/gen_unicode_ctype.py: New generator, from Mike FABIAN <mfabian@redhat.com>. * unicode-gen/ctype_compatibility.py: New verifier, from Pravin Satpute <psatpute@redhat.com> and Mike FABIAN. * unicode-gen/ctype_compatibility_test_cases.py: New verifier module, from Mike FABIAN. * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute and Mike FABIAN. * unicode-gen/utf8_compatibility.py: New verifier, from Pravin Satpute and Mike FABIAN. * charmaps/UTF-8: Update. * locales/i18n: Update. * gen-unicode-ctype.c: Remove. * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns true for ordinal indicators. ----------------------------------------------------------------------- Summary of changes: NEWS | 11 +- localedata/ChangeLog | 27 + localedata/charmaps/UTF-8 |11946 ++++++--- localedata/gen-unicode-ctype.c | 784 - localedata/locales/i18n | 2652 +- localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +- localedata/unicode-gen/DerivedCoreProperties.txt |10794 ++++++++ localedata/unicode-gen/EastAsianWidth.txt | 2121 ++ localedata/unicode-gen/Makefile | 99 + localedata/unicode-gen/UnicodeData.txt |27268 ++++++++++++++++++++ localedata/unicode-gen/ctype_compatibility.py | 546 + .../unicode-gen/ctype_compatibility_test_cases.py | 951 + localedata/unicode-gen/gen_unicode_ctype.py | 751 + localedata/unicode-gen/unicode-license.txt | 50 + localedata/unicode-gen/utf8_compatibility.py | 399 + localedata/unicode-gen/utf8_gen.py | 286 + 16 files changed, 53305 insertions(+), 5382 deletions(-) delete mode 100644 localedata/gen-unicode-ctype.c create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt create mode 100644 localedata/unicode-gen/EastAsianWidth.txt create mode 100644 localedata/unicode-gen/Makefile create mode 100644 localedata/unicode-gen/UnicodeData.txt create mode 100755 localedata/unicode-gen/ctype_compatibility.py create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py create mode 100644 localedata/unicode-gen/unicode-license.txt create mode 100755 localedata/unicode-gen/utf8_compatibility.py create mode 100755 localedata/unicode-gen/utf8_gen.py
Fixed
Please see bug 19852 for a followup issue.