Forked from #14094. Good to have separate bugs for UTF-8 and i18n file update. Tracking changes and issues will be more clearer in long term. ************************************************************* Joseph Myers 2012-05-10 20:27:32 UTC The Unicode locale data - character map and LC_CTYPE information - should be updated from Unicode 6.1 (the character map is currently based on 6.0, and LC_CTYPE is currently based on 5.0). This should be done with proper automation and wiki documentation being added of how to do future updates. I identified the following tasks at <http://sourceware.org/ml/libc-alpha/2012-05/msg00590.html>: * Ensure the character type data in localedata/charmaps/i18n can be properly reproduced from Unicode 5.0 data using gen-unicode-ctype.c, adapting gen-unicode-ctype.c as needed to replicate any changes that may have been made not using that program. * Update the character type data to Unicode 6.1, removing any local hacks from gen-unicode-ctype.c that are no longer needed. (10646:2012, corresponding to Unicode 6.1, appears to be in publication stage so should be out very soon.) * Ensure the character data in localedata/charmaps/UTF-8 can be reproduced in some automated fashion from Unicode 6.0, locating any previously used automation for this or creating some new automation if any previous automation can't be found. * Update the character data to Unicode 6.1, removing any local hacks in the automation from the previous step. * Document thoroughly on the wiki how the automation works and how to do updates to new Unicode versions. [reply] [−] Comment 1 Rich Felker 2012-05-11 03:25:47 UTC One of the major "local hacks" can be fixed, fixing many other problems at the same time, by switching to using the Unicode "Alphabetic" property (from DerivedCoreProperties.txt) instead of just categories L* for class alpha. Right now there are many languages whose letters are considered non-alphabetic by glibc because they're in category Mn or Mc or even Cf. There are "local hacks" to fix this for maybe one or two languages, but using the right Unicode property would fix it for all languages. *******************************************************
Created attachment 7926 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 1. utf8-gen.py to generate UTF-8 file 2. utf8-compatibility.py : to check backward compatibility of newly generated UTF-8 file 3. Report of new UTF-8 file backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
Created attachment 7958 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 Mike did review on it earlir and done updates to glibc-i18n git. https://github.com/pravins/glibc-i18n I have updated patch based on those improvement. Latest report on backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8 Note: Please file word Analysis, it is done after report is generated to make sure changes are correct. Mike please review patch and give your comments.
(In reply to Pravin S from comment #2) > Created attachment 7958 [details] > Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 > > Mike did review on it earlir and done updates to glibc-i18n git. > https://github.com/pravins/glibc-i18n > > I have updated patch based on those improvement. > > Latest report on backward compatibility is available AT > https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8 > > Note: Please file word Analysis, it is done after report is generated to > make sure changes are correct. > > Mike please review patch and give your comments. To check whether the new generated UTF-8 file is correct, I ran the utf8-compatibility.py script (updated version) like this: python3 utf8-compatibility.py -o ../glibc/localedata/charmaps/UTF-8 -n UTF-8 -u unicode7-0/UnicodeData.txt -e unicode7-0/EastAsianWidth.txt -c Report on CHARMAP: This character might be missing in the generated charmap: <U9F80>..<U9FC3> ************************************************************ Report on WIDTH: Total changed characters in newly generated WIDTH: 88827 changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN name=SOFT HYPHEN ... changed width: 0xa960 : 1->2 eaw=W category=Lo bidi=L name=HANGUL CHOSEONG TIKEUT-MIEUM ... many such lines ... Now I look at these lines, for example the above mentioned change where the width of a character changes from 1 to 2 and the character has East Asian Width “W” and the category “Lo” is certainly correct (This character was not in the old UTF-8 file, only characters with width 0 and 2 are in the file, 1 is the default width, every character not in the UTF-8 file gets the default width 1). As this change looks correct, I remove all lines like this from my Emacs buffer with: “M-x flush-lines RET 1->2 eaw=W category=Lo” Removing lines with obviously correct changes like this quickly reduces the number of lines to look at and after a while I have only changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN name=SOFT HYPHEN changed width: 0x3248 : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER TEN ON BLACK SQUARE changed width: 0x3249 : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER TWENTY ON BLACK SQUARE changed width: 0x324a : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER THIRTY ON BLACK SQUARE changed width: 0x324b : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER FORTY ON BLACK SQUARE changed width: 0x324c : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER FIFTY ON BLACK SQUARE changed width: 0x324d : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER SIXTY ON BLACK SQUARE changed width: 0x324e : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER SEVENTY ON BLACK SQUARE changed width: 0x324f : 2->1 eaw=A category=No bidi=L name=CIRCLED NUMBER EIGHTY ON BLACK SQUARE The change for the characters with eaw=A (East Asian Width “Ambiguous”) where the width changed from 2 to 1 is also correct, I think. The UTF-8 file is a generic file, not especially for an East Asian locale, so the “Ambiguous” characters should not have width 2. Then only the soft hyphen remains which puzzles me a bit: changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN name=SOFT HYPHEN Our script gives width 0 to this character because of category=Cf. But the display width of the soft hyphen depends on whether it is in the middle of a line (invisible then) or happens to be at the end of a line where it should be visible (and doesn’t it have a width greater than zero if it is visible?). But still giving width 0 to the soft hyphen in the UTF-8 file seems the right thing to me.
Here is another one where I have a little bit of doubt left: changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L name=LIMBU SUBJOINED LETTER YA Why is this combining characters listed with width 0 in the current UTF-8 file? In our newly generated UTF-8 file it has width 1 (because it is removed from that file). The comment in the existing UTF-8 file in glibc says: % Character width according to Unicode 5.0.0. % - Default width is 1. % - Double-width characters have width 2; generated from % "grep '^[^;]*;[WF]' EastAsianWidth.txt" % and "grep '^[^;]*;[^WF]' EastAsianWidth.txt" % - Non-spacing characters have width 0; generated from PropList.txt or % "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" % - Format control characters have width 0; generated from % "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" % - Zero width characters have width 0; generated from % "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt" This does *not* mention combining characters as needing width 0, these grep patters to not include some combining characters. The combining characters with category=Mn get width 0 because the also have bidi=NSM, for example: changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL SIGN AE but the combining characters with category=Mc are not matched by the above grep patterns, because they do *not* have bidi=NSM. That seems correct, considering they have a positive advance width: Mn Nonspacing_Mark a nonspacing combining mark (zero advance width) Mc Spacing_Mark a spacing combining mark (positive advance width) Me Enclosing_Mark an enclosing combining mark (http://www.unicode.org/reports/tr44) But how did these get into the existing UTF-8 file in glibc? Looks like the existing UTF-8 file in glibc was edited manually and not just created using the grep patterns in the comment.
localedata/Changelog entry from the patch from comment#2: > * scripts/utf8-gen.py: New script for generating UTF-8 CHARMAP from > latest UnicodeData.txt. > > * scripts/utf-compatibility.py: New script for testing backward - The script is actually called “utf8-compatibility.py”, not “utf-compatibility.py” - The patch puts the scripts “utf8-gen.py” and “utf8-compatibility.py” into the “localedata/” directory, not the “scripts/” directory.
Created attachment 7969 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 Good catch Mike. Latest patch attached.
When I try to apply the latest patch https://sourceware.org/bugzilla/attachment.cgi?id=7969 I get: $ git am bug-17588-13064.patch Applying: updated UTF-8 (charmap and width) to Unicode 7.0 /local/mfabian/src/glibc/.git/rebase-apply/patch:20: trailing whitespace. * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from /local/mfabian/src/glibc/.git/rebase-apply/patch:35221: trailing whitespace. # Contributed by error: patch failed: localedata/charmaps/UTF-8:134 error: localedata/charmaps/UTF-8: patch does not apply Patch failed at 0001 updated UTF-8 (charmap and width) to Unicode 7.0 The copy of the patch that failed is found in: /local/mfabian/src/glibc/.git/rebase-apply/patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". Applying it ignoring whitespace works: $ git am --ignore-space-change bug-17588-13064.patch Applying: updated UTF-8 (charmap and width) to Unicode 7.0 /local/mfabian/src/glibc/.git/rebase-apply/patch:20: trailing whitespace. * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from /local/mfabian/src/glibc/.git/rebase-apply/patch:35221: trailing whitespace. # Contributed by warning: 2 lines add whitespace errors. But then we get a very inconsitent use of white space, for example: @@ -2192,6 +2256,7 @@ CHARMAP <U097D> /xe0/xa5/xbd DEVANAGARI LETTER GLOTTAL STOP <U097E> /xe0/xa5/xbe DEVANAGARI LETTER DDDA <U097F> /xe0/xa5/xbf DEVANAGARI LETTER BBA +<U0980> /xe0/xa6/x80 BENGALI ANJI <U0981> /xe0/xa6/x81 BENGALI SIGN CANDRABINDU <U0982> /xe0/xa6/x82 BENGALI SIGN ANUSVARA <U0983> /xe0/xa6/x83 BENGALI SIGN VISARGA Probably it is better to always use only a single space after the the UTF-8 byte sequcence. That would make some lines change only in white space, for example <U0000> /x00 NULL would change to <U0000> /x00 NULL but the end result looks more consistent.
Created attachment 7980 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 Agree with you Mike. Earlier created patch by ignoring space thinking it will be easy to review. Thank you for pointing that applying such patch create inconsistency in final UTF-8 file. Yes, not reason to use mutliple space after utf8 hex filed. Created new patch without ignoring space.
I built glibc with the patch from comment#8. I produces some FAILs in “make check”: FAIL: localedata/cs_CZ.UTF-8/LC_CTYPE ... similar FAILs ... Shortly after starting “make check” one sees: ./charmaps/UTF-8:42734: unknown character `U00009FCD' ... similar messages ... All the above problems are cause by ranges of reserved code points which are listed in EastAsianWidth.txt like this: 9FCD..9FFF;W # Cn [51] <reserved-9FCD>..<reserved-9FFF> and these code points are not in UnicodeData.txt. Therefore, they are not generated into the CHARMAP section of glibc’s UTF-8 file and it causes the above problems if they are generated into the WIDTH section of glibc’s UTF-8 file. This can be fixed by not generating reserved code points into the WIDTH section, i.e. by ignoring the reserved code points mentioned in EastAsianWidth.txt. Patch for utf8-gen.py: diff --git a/utf8-gen.py b/utf8-gen.py index 57875b6..20b68bb 100755 --- a/utf8-gen.py +++ b/utf8-gen.py @@ -218,6 +218,8 @@ if __name__ == "__main__": write_comments(outfile, 1) elines = [] for line in easta_file.readlines(): + if re.match(r'.*<reserved-.+>\.\.<reserved-.+>.*', line): + continue if re.match(r'^[^;]*;[WF]', line): elines.append(line.strip()) process_width(outfile, flines, elines)
Created attachment 7987 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 2014-12-01 Pravin Satpute <psatpute@redhat.com> [BZ #17588 #13064] * charmaps/UTF-8: Updated UTF-8 CHARMAP and WIDTH to Unicode 7.0.0. * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from latest UnicodeData.txt. * localedata/utf8-compatibility.py: New script for testing backward compatibility of newly generated UTF-8 file. Reviewed and improved by Mike FABIAN <mfabian@redhat.com> ------------------------------------------------------------------------------ Yes, i also able to reproduce same issues while building glibc with patch. This patch fixes those issues.
Created attachment 8009 [details] Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0 2014-12-12 Pravin Satpute <psatpute@redhat.com> [BZ #17588 #13064] * charmaps/UTF-8: Updated UTF-8 CHARMAP and WIDTH to Unicode 7.0.0. * localedata/utf8_gen.py: New script for generating UTF-8 CHARMAP from latest UnicodeData.txt. * localedata/utf8_compatibility.py: New script for testing backward compatibility of newly generated UTF-8 file. Reviewed and improved by Mike FABIAN <mfabian@redhat.com> ******************************************************************************* In this patch Mike fixed pylint warning raised by glibc/scripts/pylint.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 (commit) from e4a399dc3dbb3228eb39af230ad11bc42a018c93 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 Author: Alexandre Oliva <aoliva@redhat.com> Date: Fri Feb 20 20:14:59 2015 -0200 Unicode 7.0.0 update; added generator scripts. for localedata/ChangeLog [BZ #17588] [BZ #13064] [BZ #14094] [BZ #17998] * unicode-gen/Makefile: New. * unicode-gen/unicode-license.txt: New, from Unicode. * unicode-gen/UnicodeData.txt: New, from Unicode. * unicode-gen/DerivedCoreProperties.txt: New, from Unicode. * unicode-gen/EastAsianWidth.txt: New, from Unicode. * unicode-gen/gen_unicode_ctype.py: New generator, from Mike FABIAN <mfabian@redhat.com>. * unicode-gen/ctype_compatibility.py: New verifier, from Pravin Satpute <psatpute@redhat.com> and Mike FABIAN. * unicode-gen/ctype_compatibility_test_cases.py: New verifier module, from Mike FABIAN. * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute and Mike FABIAN. * unicode-gen/utf8_compatibility.py: New verifier, from Pravin Satpute and Mike FABIAN. * charmaps/UTF-8: Update. * locales/i18n: Update. * gen-unicode-ctype.c: Remove. * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns true for ordinal indicators. ----------------------------------------------------------------------- Summary of changes: NEWS | 11 +- localedata/ChangeLog | 27 + localedata/charmaps/UTF-8 |11946 ++++++--- localedata/gen-unicode-ctype.c | 784 - localedata/locales/i18n | 2652 +- localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +- localedata/unicode-gen/DerivedCoreProperties.txt |10794 ++++++++ localedata/unicode-gen/EastAsianWidth.txt | 2121 ++ localedata/unicode-gen/Makefile | 99 + localedata/unicode-gen/UnicodeData.txt |27268 ++++++++++++++++++++ localedata/unicode-gen/ctype_compatibility.py | 546 + .../unicode-gen/ctype_compatibility_test_cases.py | 951 + localedata/unicode-gen/gen_unicode_ctype.py | 751 + localedata/unicode-gen/unicode-license.txt | 50 + localedata/unicode-gen/utf8_compatibility.py | 399 + localedata/unicode-gen/utf8_gen.py | 286 + 16 files changed, 53305 insertions(+), 5382 deletions(-) delete mode 100644 localedata/gen-unicode-ctype.c create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt create mode 100644 localedata/unicode-gen/EastAsianWidth.txt create mode 100644 localedata/unicode-gen/Makefile create mode 100644 localedata/unicode-gen/UnicodeData.txt create mode 100755 localedata/unicode-gen/ctype_compatibility.py create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py create mode 100644 localedata/unicode-gen/unicode-license.txt create mode 100755 localedata/unicode-gen/utf8_compatibility.py create mode 100755 localedata/unicode-gen/utf8_gen.py
Fixed