|Summary:||charmaps/UTF-8: incorrect wcwidth for U+3099 and U+309A|
|Product:||glibc||Reporter:||Egmont Koblinger <egmont>|
|Component:||localedata||Assignee:||Mike FABIAN <maiku.fabian>|
|Severity:||normal||CC:||aoliva, libc-locales, maiku.fabian, mfabian, tg|
Description Egmont Koblinger 2016-03-22 09:12:18 UTC
(After running setlocale() with en_US.UTF-8 or something similar) wcwidth() for U+3099 and U+309A (and presumably a few others) returns: · 0 in glibc up to 2.21, · 2 in glibc 2.22 & 2.23. Quoting from Unicode 8.0: http://unicode.org/reports/tr11/ "ED4. East Asian Wide (W): All other characters that are always wide." "6.2 Combining Marks [...] nonspacing marks used only with wide characters are given a W." http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt "3099..309A;W # Mn  COMBINING [...]" According to these, I believe the correct return value would be 0 (it's a non-spacing mark). Markus Kuhn's wcwidth (https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c) also returns 0. We found this originally being reported against VTE (the terminal emulation widget behind gnome-terminal and others) causing incorrect look there: https://bugzilla.gnome.org/show_bug.cgi?id=762052. The conclusion there (beginning at comment:22) was also the same: it should return 0.
Comment 1 Egmont Koblinger 2016-03-22 09:27:14 UTC
Forwarding VTE maintainer's observation here: The bug was introduced in glibc commit 4a4839c94a4c93ffc0d5b95c69a08b02a57007f2. It's due to a bug in the unicode generation scripts, see https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c18 where the problem was mentioned but the wrong choice made; the script needs to be smarter.
Comment 2 Mike Frysinger 2016-04-22 05:06:27 UTC
isn't the issue fundamentally that the official unicode's data is wrong ? so once this is fixed in unicode.org, glibc will roll the fix automatically ? they have a form for it: http://unicode.org/reporting.html
Comment 3 Egmont Koblinger 2016-04-22 09:33:51 UTC
I cannot tell if it's a bug or an unfortunate design in Unicode database, sorry. At least, even if it's a Unicode bug, glibc used to contain a workaround for this bug which was accidentally removed and probably should be restored for the time being.
Comment 4 Mike Frysinger 2016-04-22 19:14:06 UTC
i think we should get this clarified/documented before we continue to stumble blindly hoping for the best :) seems like bug 4335 is also related ...
Comment 5 Egmont Koblinger 2016-04-22 19:20:54 UTC
(In reply to Mike Frysinger from comment #4) > seems like bug 4335 is also related ... Not too much, I think. That one is about defining locales where ambiguous width characters take up 2 cells instead of 1. This one is about the width of combining accents themselves that are intended to be applied on top of double wide (not ambiguous but clearly double wide) characters.
Comment 6 Thorsten Glaser 2017-07-11 14:23:24 UTC
I’ve filed https://sourceware.org/bugzilla/show_bug.cgi?id=21750 noting _all_ differences from Markus Kuhn’s xterm code (updated for Unicode 10) to the current glibc localedata. For this particular problem, the fix is easy (interestingly enough, I had a similar bug in MirBSD when redoing the wcwidth code): read EastAsianWidth before, not after, UnicodeData, so the NSM bidi class overrides the width set by the former.
Comment 7 firstname.lastname@example.org 2017-08-17 09:07:13 UTC
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via bb6274ee1293a6bc76d9d7c889783303de181295 (commit) via c14b84baae83bfb73f7cd00ba7c24964ad1c712c (commit) via 7a79e321c6f85b204036c33d85f6b2aa794e7c76 (commit) via 267ee5d7ab57591a6b1bc2d2a010c88188427063 (commit) via 41b6f0ce85d98c62739b04863e8c38a1f4154e80 (commit) via 580be3035d2e0f479c4ac955bf719b0bf936f5cf (commit) from 038d1cafafb3094a9fbebd35f4aa8d0ebae0e55b (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb6274ee1293a6bc76d9d7c889783303de181295 commit bb6274ee1293a6bc76d9d7c889783303de181295 Author: Akhilesh Kumar <email@example.com> Date: Wed Aug 16 15:33:58 2017 +0530 Fix abmon for bem_ZM Until now the abbreviated month names were in English. [BZ #21960] * locales/bem_ZM (LC_TIME): Fix abmon, make it agree with CLDR. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c14b84baae83bfb73f7cd00ba7c24964ad1c712c commit c14b84baae83bfb73f7cd00ba7c24964ad1c712c Author: Akhilesh Kumar <firstname.lastname@example.org> Date: Wed Aug 16 18:01:53 2017 +0530 Fix country name for xh_ZA [BZ #21959] * locales/xh_ZA (LC_ADDRESS): Fix country name. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a79e321c6f85b204036c33d85f6b2aa794e7c76 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76 Author: Thorsten Glaser <email@example.com> Date: Fri Jul 14 14:02:50 2017 +0200 Refresh generated charmap data and ChangeLog [BZ #21750] * charmaps/UTF-8: Refresh. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=267ee5d7ab57591a6b1bc2d2a010c88188427063 commit 267ee5d7ab57591a6b1bc2d2a010c88188427063 Author: Thorsten Glaser <firstname.lastname@example.org> Date: Fri Jul 14 14:02:46 2017 +0200 Resolve some historically special cases of ambiguous width [BZ #21750] * unicode-gen/utf8_gen.py (U+00AD): Set width to 1. * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0. * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2. * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=41b6f0ce85d98c62739b04863e8c38a1f4154e80 commit 41b6f0ce85d98c62739b04863e8c38a1f4154e80 Author: Thorsten Glaser <email@example.com> Date: Fri Jul 14 14:02:44 2017 +0200 Handle more cases of combining characters [BZ #21750] * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=580be3035d2e0f479c4ac955bf719b0bf936f5cf commit 580be3035d2e0f479c4ac955bf719b0bf936f5cf Author: Thorsten Glaser <firstname.lastname@example.org> Date: Fri Jul 14 14:02:37 2017 +0200 UnicodeData has precedence over EastAsianWidth [BZ #19852] [BZ #21750] * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before UnicodeData lines so the latter have precedence; remove hack to group output by EastAsianWidth ranges. ----------------------------------------------------------------------- Summary of changes: localedata/ChangeLog | 24 + localedata/charmaps/UTF-8 |111468 +++++++++++++++++++++++++++++++++++- localedata/locales/bem_ZM | 25 +- localedata/locales/xh_ZA | 5 +- localedata/unicode-gen/utf8_gen.py | 38 +- 5 files changed, 111400 insertions(+), 160 deletions(-)
Comment 8 Mike FABIAN 2017-08-17 13:52:23 UTC
FIXED thanks to Thorsten Glaser.