Bug 17588

Summary: Update UTF-8 charmap and width to Unicode 7.0.0
Product: glibc Reporter: Pravin S <pravin.d.s>
Component: localedataAssignee: Pravin S <pravin.d.s>
Status: RESOLVED FIXED    
Severity: normal CC: aoliva, libc-locales, maiku.fabian, pravin.d.s
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Bug Depends on:    
Bug Blocks: 14094    
Attachments: Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Description Pravin S 2014-11-12 10:11:18 UTC
Forked from #14094. Good to have separate bugs for UTF-8 and i18n file update. Tracking changes and issues will be more clearer in long term.
*************************************************************
 Joseph Myers 2012-05-10 20:27:32 UTC

The Unicode locale data - character map and LC_CTYPE information - should be updated from Unicode 6.1 (the character map is currently based on 6.0, and LC_CTYPE is currently based on 5.0).  This should be done with proper automation and wiki documentation being added of how to do future updates.  I identified the following tasks at <http://sourceware.org/ml/libc-alpha/2012-05/msg00590.html>:

* Ensure the character type data in localedata/charmaps/i18n can be
  properly reproduced from Unicode 5.0 data using gen-unicode-ctype.c,
  adapting gen-unicode-ctype.c as needed to replicate any changes that
  may have been made not using that program.

* Update the character type data to Unicode 6.1, removing any local
  hacks from gen-unicode-ctype.c that are no longer needed.
  (10646:2012, corresponding to Unicode 6.1, appears to be in
  publication stage so should be out very soon.)

* Ensure the character data in localedata/charmaps/UTF-8 can be
  reproduced in some automated fashion from Unicode 6.0, locating any
  previously used automation for this or creating some new automation
  if any previous automation can't be found.

* Update the character data to Unicode 6.1, removing any local hacks
  in the automation from the previous step.

* Document thoroughly on the wiki how the automation works and how to
  do updates to new Unicode versions.

[reply] [−] Comment 1 Rich Felker 2012-05-11 03:25:47 UTC

One of the major "local hacks" can be fixed, fixing many other problems at the same time, by switching to using the Unicode "Alphabetic" property (from DerivedCoreProperties.txt) instead of just categories L* for class alpha. Right now there are many languages whose letters are considered non-alphabetic by glibc because they're in category Mn or Mc or even Cf. There are "local hacks" to fix this for maybe one or two languages, but using the right Unicode property would fix it for all languages.
*******************************************************
Comment 1 Pravin S 2014-11-12 11:22:32 UTC
Created attachment 7926 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

1. utf8-gen.py to generate UTF-8 file
2. utf8-compatibility.py : to check backward compatibility of newly generated UTF-8 file
3. Report of new UTF-8 file backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8
Comment 2 Pravin S 2014-11-21 06:27:23 UTC
Created attachment 7958 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Mike did review on it earlir and done updates to glibc-i18n git. https://github.com/pravins/glibc-i18n

I have updated patch based on those improvement.

Latest report on backward compatibility is available AT https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8  

Note: Please file word Analysis, it is done after report is generated to make sure changes are correct.

Mike please review patch and give your comments.
Comment 3 Mike FABIAN 2014-11-21 07:35:17 UTC
(In reply to Pravin S from comment #2)
> Created attachment 7958 [details]
> Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0
> 
> Mike did review on it earlir and done updates to glibc-i18n git.
> https://github.com/pravins/glibc-i18n
> 
> I have updated patch based on those improvement.
> 
> Latest report on backward compatibility is available AT
> https://raw.githubusercontent.com/pravins/glibc-i18n/master/report-utf8  
> 
> Note: Please file word Analysis, it is done after report is generated to
> make sure changes are correct.
> 
> Mike please review patch and give your comments.

To check whether the new generated UTF-8 file is correct,
I ran the utf8-compatibility.py script (updated version) like this:

python3 utf8-compatibility.py -o ../glibc/localedata/charmaps/UTF-8 -n UTF-8  -u unicode7-0/UnicodeData.txt -e unicode7-0/EastAsianWidth.txt -c
Report on CHARMAP:
This character might be missing in the generated charmap:  <U9F80>..<U9FC3>
************************************************************

Report on WIDTH:
Total changed characters in newly generated WIDTH:  88827
changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN
...
changed width: 0xa960 : 1->2 eaw=W category=Lo bidi=L   name=HANGUL CHOSEONG TIKEUT-MIEUM
...
many such lines
...

Now I look at these lines, for example the above mentioned change
where the width of a character changes from 1 to 2 and the character has
East Asian Width “W” and the category “Lo” is certainly correct
(This character was not in the old UTF-8 file, only characters with
width 0 and 2 are in the file, 1 is the default width, every character
not in the UTF-8 file gets the default width 1).

As this change looks correct, I remove all lines like this from my Emacs
buffer with:

    “M-x flush-lines RET 1->2 eaw=W category=Lo”

Removing lines with obviously correct changes like this quickly
reduces the number of lines to look at and after a while I have only

changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN
changed width: 0x3248 : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER TEN ON BLACK SQUARE
changed width: 0x3249 : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER TWENTY ON BLACK SQUARE
changed width: 0x324a : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER THIRTY ON BLACK SQUARE
changed width: 0x324b : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER FORTY ON BLACK SQUARE
changed width: 0x324c : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER FIFTY ON BLACK SQUARE
changed width: 0x324d : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER SIXTY ON BLACK SQUARE
changed width: 0x324e : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER SEVENTY ON BLACK SQUARE
changed width: 0x324f : 2->1 eaw=A category=No bidi=L   name=CIRCLED NUMBER EIGHTY ON BLACK SQUARE

The change for the characters with eaw=A (East Asian Width
“Ambiguous”) where the width changed from 2 to 1 is also correct, I think.
The UTF-8 file is a generic file, not especially for an East Asian locale,
so the “Ambiguous” characters should not have width 2.

Then only the soft hyphen remains which puzzles me a bit:

changed width: 0x00ad : 1->0 eaw=A category=Cf bidi=BN  name=SOFT HYPHEN

Our script gives width 0 to this character because of category=Cf.

But the display width of the soft hyphen depends on whether it is
in the middle of a line (invisible then) or happens to be at the end
of a line where it should be visible (and doesn’t it have a width greater
than zero if it is visible?).
But still giving width 0 to the soft hyphen in the UTF-8 file seems the
right thing to me.
Comment 4 Mike FABIAN 2014-11-21 12:36:35 UTC
Here is another one where I have a little bit of doubt left:

changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L   name=LIMBU SUBJOINED LETTER YA

Why is this combining characters listed with width 0 in the current UTF-8 file?

In our newly generated UTF-8 file it has width 1 (because it is removed from that  file).

The comment in the existing UTF-8 file in glibc says:

% Character width according to Unicode 5.0.0.
% - Default width is 1.
% - Double-width characters have width 2; generated from
%        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

This does *not* mention combining characters as needing width 0,
these grep patters to not include some combining characters.

The combining characters with category=Mn get width 0 because the
also have bidi=NSM, for example:

changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL SIGN AE

but the combining characters with category=Mc are not matched by
the above grep patterns, because they do *not* have bidi=NSM.
That seems correct, considering they have a positive advance width:

Mn 	Nonspacing_Mark  a nonspacing combining mark (zero advance width)
Mc 	Spacing_Mark 	 a spacing combining mark (positive advance width)
Me 	Enclosing_Mark 	 an enclosing combining mark

(http://www.unicode.org/reports/tr44)

But how did these get into the existing UTF-8 file in glibc?

Looks like the existing UTF-8 file in glibc was edited manually
and not just created using the grep patterns in the comment.
Comment 5 Mike FABIAN 2014-11-21 16:49:05 UTC
localedata/Changelog entry from the patch from comment#2:

>    * scripts/utf8-gen.py: New script for generating UTF-8 CHARMAP from
>    latest UnicodeData.txt.
> 
>    * scripts/utf-compatibility.py: New script for testing backward

- The script is actually called “utf8-compatibility.py”, not
  “utf-compatibility.py”
- The patch puts the scripts “utf8-gen.py” and “utf8-compatibility.py”
  into the “localedata/” directory,  not the “scripts/” directory.
Comment 6 Pravin S 2014-11-24 16:34:14 UTC
Created attachment 7969 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Good catch Mike. Latest patch attached.
Comment 7 Mike FABIAN 2014-12-01 11:49:33 UTC
When I try to apply the latest patch

https://sourceware.org/bugzilla/attachment.cgi?id=7969

I get:

    $ git am bug-17588-13064.patch
    Applying: updated UTF-8 (charmap and width) to Unicode 7.0
    /local/mfabian/src/glibc/.git/rebase-apply/patch:20: trailing whitespace.
            * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from 
    /local/mfabian/src/glibc/.git/rebase-apply/patch:35221: trailing whitespace.
    # Contributed by 
    error: patch failed: localedata/charmaps/UTF-8:134
    error: localedata/charmaps/UTF-8: patch does not apply
    Patch failed at 0001 updated UTF-8 (charmap and width) to Unicode 7.0
    The copy of the patch that failed is found in:
       /local/mfabian/src/glibc/.git/rebase-apply/patch
    When you have resolved this problem, run "git am --continue".
    If you prefer to skip this patch, run "git am --skip" instead.
    To restore the original branch and stop patching, run "git am --abort".

Applying  it ignoring whitespace works:

    $ git am --ignore-space-change bug-17588-13064.patch
    Applying: updated UTF-8 (charmap and width) to Unicode 7.0
    /local/mfabian/src/glibc/.git/rebase-apply/patch:20: trailing whitespace.
            * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from 
    /local/mfabian/src/glibc/.git/rebase-apply/patch:35221: trailing whitespace.
    # Contributed by 
    warning: 2 lines add whitespace errors.

But then we get a very inconsitent use of white space, for example:


    @@ -2192,6 +2256,7 @@ CHARMAP
     <U097D>     /xe0/xa5/xbd DEVANAGARI LETTER GLOTTAL STOP
     <U097E>     /xe0/xa5/xbe DEVANAGARI LETTER DDDA
     <U097F>     /xe0/xa5/xbf DEVANAGARI LETTER BBA
    +<U0980>     /xe0/xa6/x80         BENGALI ANJI
     <U0981>     /xe0/xa6/x81 BENGALI SIGN CANDRABINDU
     <U0982>     /xe0/xa6/x82 BENGALI SIGN ANUSVARA
     <U0983>     /xe0/xa6/x83 BENGALI SIGN VISARGA

Probably it is better to always use only a single space after
the the UTF-8 byte sequcence. That would make some lines change
only in white space, for example

<U0000>     /x00         NULL

would change to

<U0000>     /x00 NULL

but the end result looks more consistent.
Comment 8 Pravin S 2014-12-01 11:54:46 UTC
Created attachment 7980 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

Agree with you Mike.

Earlier created patch by ignoring space thinking it will be easy to review. Thank you for pointing that applying such patch create inconsistency in final UTF-8 file.

Yes, not reason to use mutliple space after utf8 hex filed. 

Created new patch without ignoring space.
Comment 9 Mike FABIAN 2014-12-03 07:17:25 UTC
I built glibc with the patch from comment#8.

I produces some FAILs in “make check”:

    FAIL: localedata/cs_CZ.UTF-8/LC_CTYPE
    ... similar FAILs ...

Shortly after starting “make check” one sees:

    ./charmaps/UTF-8:42734: unknown character `U00009FCD'
    ... similar messages ...

All the above problems are cause by ranges of reserved code points
which are listed in EastAsianWidth.txt like this:

    9FCD..9FFF;W     # Cn    [51] <reserved-9FCD>..<reserved-9FFF>

and these code points are not in UnicodeData.txt.

Therefore, they are not generated into the CHARMAP section
of glibc’s UTF-8 file and it causes the above problems if they
are generated into the WIDTH section of glibc’s  UTF-8 file.

This can be fixed by not generating reserved code points into
the WIDTH section, i.e. by ignoring the  reserved  code points
mentioned in EastAsianWidth.txt. Patch for utf8-gen.py:

diff --git a/utf8-gen.py b/utf8-gen.py
index 57875b6..20b68bb 100755
--- a/utf8-gen.py
+++ b/utf8-gen.py
@@ -218,6 +218,8 @@ if __name__ == "__main__":
         write_comments(outfile, 1)
         elines = []
         for line in easta_file.readlines():
+                if re.match(r'.*<reserved-.+>\.\.<reserved-.+>.*', line):
+                        continue
                 if re.match(r'^[^;]*;[WF]', line):
                         elines.append(line.strip())
         process_width(outfile, flines, elines)
Comment 10 Pravin S 2014-12-03 11:49:19 UTC
Created attachment 7987 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

2014-12-01  Pravin Satpute  <psatpute@redhat.com>

        [BZ #17588 #13064]
        * charmaps/UTF-8: Updated UTF-8 CHARMAP and WIDTH to Unicode 7.0.0.

        * localedata/utf8-gen.py: New script for generating UTF-8 CHARMAP from 
        latest UnicodeData.txt.

        * localedata/utf8-compatibility.py: New script for testing backward
        compatibility of newly generated UTF-8 file.
       Reviewed and improved by Mike FABIAN <mfabian@redhat.com>

------------------------------------------------------------------------------

Yes, i also able to reproduce same issues while building glibc with patch. This patch fixes those issues.
Comment 11 Pravin S 2014-12-12 11:31:37 UTC
Created attachment 8009 [details]
Patch to update UTF-8 CHARMAP and WIDTH to unicode 7.0

2014-12-12  Pravin Satpute  <psatpute@redhat.com>

        [BZ #17588 #13064]
        * charmaps/UTF-8: Updated UTF-8 CHARMAP and WIDTH to Unicode 7.0.0.

        * localedata/utf8_gen.py: New script for generating UTF-8 CHARMAP from 
        latest UnicodeData.txt.

        * localedata/utf8_compatibility.py: New script for testing backward
        compatibility of newly generated UTF-8 file.
       Reviewed and improved by Mike FABIAN <mfabian@redhat.com>

*******************************************************************************
In this patch Mike fixed pylint warning raised by glibc/scripts/pylint.
Comment 12 Sourceware Commits 2015-02-20 22:36:44 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  4a4839c94a4c93ffc0d5b95c69a08b02a57007f2 (commit)
      from  e4a399dc3dbb3228eb39af230ad11bc42a018c93 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4a4839c94a4c93ffc0d5b95c69a08b02a57007f2

commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2
Author: Alexandre Oliva <aoliva@redhat.com>
Date:   Fri Feb 20 20:14:59 2015 -0200

    Unicode 7.0.0 update; added generator scripts.
    
    for  localedata/ChangeLog
    
    	[BZ #17588]
    	[BZ #13064]
    	[BZ #14094]
    	[BZ #17998]
    	* unicode-gen/Makefile: New.
    	* unicode-gen/unicode-license.txt: New, from Unicode.
    	* unicode-gen/UnicodeData.txt: New, from Unicode.
    	* unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
    	* unicode-gen/EastAsianWidth.txt: New, from Unicode.
    	* unicode-gen/gen_unicode_ctype.py: New generator, from Mike
    	FABIAN <mfabian@redhat.com>.
    	* unicode-gen/ctype_compatibility.py: New verifier, from
    	Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
    	* unicode-gen/ctype_compatibility_test_cases.py: New verifier
    	module, from Mike FABIAN.
    	* unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
    	and Mike FABIAN.
    	* unicode-gen/utf8_compatibility.py: New verifier, from Pravin
    	Satpute and Mike FABIAN.
    	* charmaps/UTF-8: Update.
    	* locales/i18n: Update.
    	* gen-unicode-ctype.c: Remove.
    	* tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
    	true for ordinal indicators.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                                               |   11 +-
 localedata/ChangeLog                               |   27 +
 localedata/charmaps/UTF-8                          |11946 ++++++---
 localedata/gen-unicode-ctype.c                     |  784 -
 localedata/locales/i18n                            | 2652 +-
 localedata/tst-ctype-de_DE.ISO-8859-1.in           |    2 +-
 localedata/unicode-gen/DerivedCoreProperties.txt   |10794 ++++++++
 localedata/unicode-gen/EastAsianWidth.txt          | 2121 ++
 localedata/unicode-gen/Makefile                    |   99 +
 localedata/unicode-gen/UnicodeData.txt             |27268 ++++++++++++++++++++
 localedata/unicode-gen/ctype_compatibility.py      |  546 +
 .../unicode-gen/ctype_compatibility_test_cases.py  |  951 +
 localedata/unicode-gen/gen_unicode_ctype.py        |  751 +
 localedata/unicode-gen/unicode-license.txt         |   50 +
 localedata/unicode-gen/utf8_compatibility.py       |  399 +
 localedata/unicode-gen/utf8_gen.py                 |  286 +
 16 files changed, 53305 insertions(+), 5382 deletions(-)
 delete mode 100644 localedata/gen-unicode-ctype.c
 create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt
 create mode 100644 localedata/unicode-gen/EastAsianWidth.txt
 create mode 100644 localedata/unicode-gen/Makefile
 create mode 100644 localedata/unicode-gen/UnicodeData.txt
 create mode 100755 localedata/unicode-gen/ctype_compatibility.py
 create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py
 create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py
 create mode 100644 localedata/unicode-gen/unicode-license.txt
 create mode 100755 localedata/unicode-gen/utf8_compatibility.py
 create mode 100755 localedata/unicode-gen/utf8_gen.py
Comment 13 Alexandre Oliva 2015-02-21 00:06:12 UTC
Fixed