Bug 19852

Summary: charmaps/UTF-8: incorrect wcwidth for U+3099 and U+309A
Product: glibc Reporter: Egmont Koblinger <egmont>
Component: localedataAssignee: Mike FABIAN <maiku.fabian>
Status: RESOLVED FIXED    
Severity: normal CC: aoliva, libc-locales, maiku.fabian, mfabian, tg
Priority: P2 Flags: fweimer: security-
Version: 2.23   
Target Milestone: 2.27   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=14094
https://sourceware.org/bugzilla/show_bug.cgi?id=19919
https://sourceware.org/bugzilla/show_bug.cgi?id=4335
Host: Target:
Build: Last reconfirmed:

Description Egmont Koblinger 2016-03-22 09:12:18 UTC
(After running setlocale() with en_US.UTF-8 or something similar)

wcwidth() for U+3099 and U+309A (and presumably a few others) returns:

· 0 in glibc up to 2.21,
· 2 in glibc 2.22 & 2.23.

Quoting from Unicode 8.0:

http://unicode.org/reports/tr11/

"ED4. East Asian Wide (W): All other characters that are always wide."

"6.2 Combining Marks [...] nonspacing marks used only with wide characters are given a W."

http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

"3099..309A;W     # Mn     [2] COMBINING [...]"

According to these, I believe the correct return value would be 0 (it's a non-spacing mark).

Markus Kuhn's wcwidth (https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c) also returns 0.

We found this originally being reported against VTE (the terminal emulation widget behind gnome-terminal and others) causing incorrect look there: https://bugzilla.gnome.org/show_bug.cgi?id=762052. The conclusion there (beginning at comment:22) was also the same: it should return 0.
Comment 1 Egmont Koblinger 2016-03-22 09:27:14 UTC
Forwarding VTE maintainer's observation here:

The bug was introduced in glibc commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2. It's due to a bug in the unicode generation scripts, see https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c18 where the problem was mentioned but the wrong choice made; the script needs to be smarter.
Comment 2 Mike Frysinger 2016-04-22 05:06:27 UTC
isn't the issue fundamentally that the official unicode's data is wrong ?  so once this is fixed in unicode.org, glibc will roll the fix automatically ?

they have a form for it:
  http://unicode.org/reporting.html
Comment 3 Egmont Koblinger 2016-04-22 09:33:51 UTC
I cannot tell if it's a bug or an unfortunate design in Unicode database, sorry.

At least, even if it's a Unicode bug, glibc used to contain a workaround for this bug which was accidentally removed and probably should be restored for the time being.
Comment 4 Mike Frysinger 2016-04-22 19:14:06 UTC
i think we should get this clarified/documented before we continue to stumble blindly hoping for the best :)

seems like bug 4335 is also related ...
Comment 5 Egmont Koblinger 2016-04-22 19:20:54 UTC
(In reply to Mike Frysinger from comment #4)

> seems like bug 4335 is also related ...

Not too much, I think.

That one is about defining locales where ambiguous width characters take up 2 cells instead of 1.

This one is about the width of combining accents themselves that are intended to be applied on top of double wide (not ambiguous but clearly double wide) characters.
Comment 6 Thorsten Glaser 2017-07-11 14:23:24 UTC
I’ve filed https://sourceware.org/bugzilla/show_bug.cgi?id=21750 noting _all_ differences from Markus Kuhn’s xterm code (updated for Unicode 10) to the current glibc localedata.

For this particular problem, the fix is easy (interestingly enough, I had a similar bug in MirBSD when redoing the wcwidth code): read EastAsianWidth before, not after, UnicodeData, so the NSM bidi class overrides the width set by the former.
Comment 7 Sourceware Commits 2017-08-17 09:07:13 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  bb6274ee1293a6bc76d9d7c889783303de181295 (commit)
       via  c14b84baae83bfb73f7cd00ba7c24964ad1c712c (commit)
       via  7a79e321c6f85b204036c33d85f6b2aa794e7c76 (commit)
       via  267ee5d7ab57591a6b1bc2d2a010c88188427063 (commit)
       via  41b6f0ce85d98c62739b04863e8c38a1f4154e80 (commit)
       via  580be3035d2e0f479c4ac955bf719b0bf936f5cf (commit)
      from  038d1cafafb3094a9fbebd35f4aa8d0ebae0e55b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb6274ee1293a6bc76d9d7c889783303de181295

commit bb6274ee1293a6bc76d9d7c889783303de181295
Author: Akhilesh Kumar <akhilesh.k@samsung.com>
Date:   Wed Aug 16 15:33:58 2017 +0530

    Fix abmon for bem_ZM
    
    Until now the abbreviated month names were in English.
    
    	[BZ #21960]
    	* locales/bem_ZM (LC_TIME): Fix abmon, make it agree with CLDR.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c14b84baae83bfb73f7cd00ba7c24964ad1c712c

commit c14b84baae83bfb73f7cd00ba7c24964ad1c712c
Author: Akhilesh Kumar <akhilesh.k@samsung.com>
Date:   Wed Aug 16 18:01:53 2017 +0530

    Fix country name for xh_ZA
    
    	[BZ #21959]
    	* locales/xh_ZA (LC_ADDRESS): Fix country name.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a79e321c6f85b204036c33d85f6b2aa794e7c76

commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog
    
    	[BZ #21750]
    	* charmaps/UTF-8: Refresh.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=267ee5d7ab57591a6b1bc2d2a010c88188427063

commit 267ee5d7ab57591a6b1bc2d2a010c88188427063
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:46 2017 +0200

    Resolve some historically special cases of ambiguous width
    
    [BZ #21750]
    * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
    * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
    * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
    * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=41b6f0ce85d98c62739b04863e8c38a1f4154e80

commit 41b6f0ce85d98c62739b04863e8c38a1f4154e80
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:44 2017 +0200

    Handle more cases of combining characters
    
    [BZ #21750]
    * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=580be3035d2e0f479c4ac955bf719b0bf936f5cf

commit 580be3035d2e0f479c4ac955bf719b0bf936f5cf
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:37 2017 +0200

    UnicodeData has precedence over EastAsianWidth
    
    [BZ #19852]
    [BZ #21750]
    * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
      UnicodeData lines so the latter have precedence; remove hack
      to group output by EastAsianWidth ranges.

-----------------------------------------------------------------------

Summary of changes:
 localedata/ChangeLog               |   24 +
 localedata/charmaps/UTF-8          |111468 +++++++++++++++++++++++++++++++++++-
 localedata/locales/bem_ZM          |   25 +-
 localedata/locales/xh_ZA           |    5 +-
 localedata/unicode-gen/utf8_gen.py |   38 +-
 5 files changed, 111400 insertions(+), 160 deletions(-)
Comment 8 Mike FABIAN 2017-08-17 13:52:23 UTC
FIXED thanks to Thorsten Glaser.