21750 – column width of characters incompatible with classical wcwidth

Bug 21750 - column width of characters incompatible with classical wcwidth

Summary: column width of characters incompatible with classical wcwidth

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.26

Importance:	P2 normal
Target Milestone:	2.27
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:
Blocks:	22073 22074
	Show dependency tree / graph

Reported:	2017-07-11 14:18 UTC by Thorsten Glaser
Modified:	2017-09-14 16:38 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:	2017-08-16 00:00:00

Flags:	fweimer: security-

Attachments
tarball of “git am”able patches (14.17 KB, patch) 2017-07-14 12:04 UTC, Thorsten Glaser	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thorsten Glaser 2017-07-11 14:18:39 UTC

I’ve compared the new autogenerated column width from localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() implementation from xterm (adjusted to Unicode 10.0.0) and found a few divergences (and bugs on my (MirBSD, which uses something based on xterm’s data system-wide) side, which I fixed).

1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc

Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which, when displayed as 8bit on terminals, had no combining characters at all.

Change Request to glibc: force U+00AD to width 1.

2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class: U+0CBF U+0CC6 U-00011C3F

This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same as Cf/NSM, which I do.

Change Request to glibc: handle Me/Mn category the same as NSM bidi class.

3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they combine on top of the preceding initial ones: U+1160‥U+11FF

Change Request to glibc: force U+1160‥U+11FF to width 0.

4. During parsing, EastAsianWidth data overrides UCD data, more specifically the NSM property.

This leads to U+302A‥U+302D and – see also https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being treated as width 2.

Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.

5. Ambiguous circled numbers and neutral hexagrams changed width

xterm used to set those to width 2, likely because they are ideographs and not unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD nowadays)

Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.

Note: I’ve initially reported the surprising change to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the research today (against 2.24 in Debian and git master commit 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10, double-checked *all* differences against MirBSD code and fixed a few bugs there after making it possible to compare the results (considering glibc only puts actually assigned codepoints into the localedata/charmaps/UTF-8 file).

Rationale for requesting the change in glibc is so that all systems I have access to use the same width data, preventing display artifacts and glitches up to making an editor somewhat unusable with heavy Unicode (I have test files containing the entire Unicode range). Thank you for listening.

If necessary, I will provide patches (to utf8_gen.py most likely) when asked.

Comment 1 Troy Korjuslommi 2017-07-12 11:01:43 UTC

Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible,
i.e. zero columns? If an app breaks up words at end-of-line, it can use
the soft hyphens as helpers to detect the correct locations. The app can
then add a visible hyphen to the end of the line. (If the app also reads
from the terminal, then it can e.g. ignore visible hyphens when preceded
by a soft hyphen, or use some other mechanism to mark the character as
for terminal display only).

I am not suggesting a change, if xterm etc. multitude of apps are
already handling soft hyphens in some other manner, just wondering.

Troy
 

  
On Tue, 2017-07-11 at 14:18 +0000, tg at mirbsd dot de wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> 
>             Bug ID: 21750
>            Summary: column width of characters incompatible with classical
>                     wcwidth
>            Product: glibc
>            Version: 2.26
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: tg at mirbsd dot de
>                 CC: libc-locales at sourceware dot org
>   Target Milestone: ---
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> 1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
> 
> Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
> 
> Change Request to glibc: force U+00AD to width 1.
> 
> 2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class:
> U+0CBF U+0CC6 U-00011C3F
> 
> This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same
> as Cf/NSM, which I do.
> 
> Change Request to glibc: handle Me/Mn category the same as NSM bidi class.
> 
> 3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160‥U+11FF
> 
> Change Request to glibc: force U+1160‥U+11FF to width 0.
> 
> 4. During parsing, EastAsianWidth data overrides UCD data, more specifically
> the NSM property.
> 
> This leads to U+302A‥U+302D and – see also
> https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being
> treated as width 2.
> 
> Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.
> 
> 5. Ambiguous circled numbers and neutral hexagrams changed width
> 
> xterm used to set those to width 2, likely because they are ideographs and not
> unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD
> nowadays)
> 
> Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.
> 
> 
> Note: I’ve initially reported the surprising change to Debian as
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the
> research today (against 2.24 in Debian and git master commit
> 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10,
> double-checked *all* differences against MirBSD code and fixed a few bugs there
> after making it possible to compare the results (considering glibc only puts
> actually assigned codepoints into the localedata/charmaps/UTF-8 file).
> 
> Rationale for requesting the change in glibc is so that all systems I have
> access to use the same width data, preventing display artifacts and glitches up
> to making an editor somewhat unusable with heavy Unicode (I have test files
> containing the entire Unicode range). Thank you for listening.
> 
> If necessary, I will provide patches (to utf8_gen.py most likely) when asked.
>

Comment 2 Thorsten Glaser 2017-07-12 13:38:57 UTC

(In reply to Troy Korjuslommi from comment #1)
> Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible,
> i.e. zero columns? If an app breaks up words at end-of-line, it can use
> the soft hyphens as helpers to detect the correct locations. The app can

Yes, in theory. This codepoint could be used in the *input data* to
determine soft breaks. However (see below) they should *not* output
those to a terminal emulator (GUIs that handle this themselves are
likely fine).

> I am not suggesting a change, if xterm etc. multitude of apps are
> already handling soft hyphens in some other manner, just wondering.

Similar to U+0060 (the gravis accent 「`」) however, terminal emulators
have been treating both ASCII (for U+0060) and 8-bit codepages like
ISO 8859-1 (for U+00AD) as each (non-control) character having a constant
width of 1 (for SBCS), and xterm’s wcwidth() code had special handling
to force U+00AD to 1:

/*
 […]
 *    - SOFT HYPHEN (U+00AD) has a column width of 1.
 […]
 */
[…]
  /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */

Source: http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=1.1.103.1;content-type=text%2Fplain


So you’d want to output U+0060 U+0008 U+0061 (` + backspace + a) to get à on a (printed) terminal (or in code that uses such to emulate them), and similarily, strip soft hyphens from the output (or manifest them as regular ones) before outputting a soft-wrapped text (mostly because the terminal emulator will also not soft-wrap, it’ll break at the end of the line, so you’d convert U+00AD to some kind of hyphen (hyphen-minus or U+2010 perhaps) followed by a line break(⚠) if preparing something fopr terminal output).


I’ve noticed the incompatibilities especially when the hexagrams, one of which I’m using for UI purposes, changed width, and tried to discover all of them in order to harmonise the width assumptions the various programs I have access to use on all systems I use, with classical xterm wcwidth.c as base, since those widths are the domain of a fixed-cell terminal emulator more than something else (which can use its own data, if necessary).

I do volunteer to provide patches, here and elsewhere, so that, with the same UCD version as input, we get consistent output (and I’ve sanity-checked the output I got before opening this report).

Comment 3 Thorsten Glaser 2017-07-14 12:04:02 UTC

Created attachment 10257 [details]
tarball of “git am”able patches

I’ve done the patches and compared the output, which Looks Good To Me™. Please apply.

Comment 4 Mike FABIAN 2017-08-16 13:50:04 UTC

(In reply to Thorsten Glaser from comment #3)
> Created attachment 10257 [details]
> tarball of “git am”able patches
> 
> I’ve done the patches and compared the output, which Looks Good To Me™.
> Please apply.

The following is a chatlog of the discussion I had with Thorsten about these patches.
I also think the patches are OK, but I wanted to make sure because they
make the width data in glibc deviate a bit from the width data in Unicode.

So here is the chatlog:

<mfabian> mira|AO: Hallo! Von Dir ist der Patch in
          https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c3 , nicht wahr?
                                                          [17年08月15日 15:50:08]
<mira|AO> mfabian: ja [17年08月15日 16:13:39]
<mira|AO> hab die charmaps/UTF-8.gz bei mir lokal auf Debian ausgetauscht,
          locales regeneriert, alles tut wie’s soll ☺ [17年08月15日 16:15:15]
<mfabian> mira|AO: Ja, der Patch sieht gut aus! [17年08月15日 16:23:29]
<mfabian> Ich bin nur nicht ganz sicher  bei den 4 Ausnahmen, die absichtlich
          von den Daten in UnicodeData.txt und EastAsianWidth.txt abweisen,
          also: [17年08月15日 16:24:21]
<mfabian> +       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
                                                          [17年08月15日 16:24:24]
<mfabian> +       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
<mfabian> +       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
<mfabian> +       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
<mfabian> 
<mfabian> Die ersten beiden sind so,  weil  das in Markus Kuhn’s wcwidth so war
          ... [17年08月15日 16:24:46]
<mfabian> https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c [17年08月15日 16:24:56]
<mfabian> Aber warum sind dann U+1160 ... U+11FF ausdrücklich als "N" in
          EastAsianWidth.txt eingetragen? [17年08月15日 16:26:02]
<mfabian> U+3248..U+324F auf width 2 zu setzen sieht vom Gefühl her richtig aus
          für mich, weil die Glyphen dazu in den meisten Fonts, die ich habe,
          tatsächlich quadratisch sind. [17年08月15日 16:26:54]
<mfabian> EastAsianWidth.txt hat “3248..324F;A     # No     [8] CIRCLED NUMBER
          TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE”, also
          “ambiguous”, d.h. normalerweise width 1  aber width 2 in  CJK legacy
          locales ... [17年08月15日 16:28:09]
<mfabian> Wenn die in den  meisten Fonts aber tatsächlich quadratisch sind,
          scheint es mir OK width auf 2  zu setzen. [17年08月15日 16:28:34]
<mfabian> Bei den Hexagrammen “4DC0..4DFF;N     # So    [64] HEXAGRAM FOR THE
          CREATIVE HEAVEN..HEXAGRAM FOR BEFORE COMPLETION” kommt es mir
          komischer vor, im “Deja Vu Sans” font sind die zum Beispiel drin, und
          sind dort von der Breite irgendwo zwischen single width und double
          width. [17年08月15日 16:29:45]
<mfabian> Also 0001-UnicodeData-has-precedence-over-EastAsianWidth.patch finde
          ich auf jeden  Fall  gut, und
          0002-Handle-more-cases-of-combining-characters.patch auch.
                                                          [17年08月15日 16:30:26]
<mira|AO> mfabian: alle vier sind das [17年08月15日 16:30:35]
<mfabian> Aber bei
          0003-Resolve-some-historically-special-cases-of-ambiguous.patch bin
          ich unsicher  ... [17年08月15日 16:30:40]
<mira|AO> die Hexagramme gehen in narrow einfach gar nicht (schau dir 9x18 an,
          da basiere ich meine Arbeit an Fonts drauf) [17年08月15日 16:31:11]
<mira|AO> 1160‥11FF, moment, ich schau grad [17年08月15日 16:31:25]
<mfabian> Wir würden da vom Unicode Standard aus “historischen” Gründen
          abweichen? [17年08月15日 16:31:26]
<mira|AO> achso [17年08月15日 16:31:31]
<mfabian> Oder ist der Unicode Standard da falsch? [17年08月15日 16:31:38]
<mira|AO> das ist, weil das koreanische Zeichenbestandteile sind
                                                          [17年08月15日 16:31:40]
<mira|AO> und zwar [17年08月15日 16:31:44]
<mfabian> Sollten wir  das als Bug gegen den Unicode  Standard reporten?
                                                          [17年08月15日 16:31:56]
<mira|AO> du hast immer ein choseong (1100‥115F), ein jungseong und ein
          jongseong, und die zusammen ergeben ein zeichen [17年08月15日 16:32:20]
<mira|AO> d.h. die kombinieren miteinander [17年08月15日 16:32:24]
<mira|AO> individuell sind das alles koreanische Zeichen(fragmente), aber die
          1160‥11FF sind halt bei mgk als combining, weil die so gerendert
          werden sollen, wenn die direkt hintereinander stehen
                                                          [17年08月15日 16:32:52]
<mfabian> Richtig, aber warum stehen sie in EastAsianWidth als  "N" drin?
                                                          [17年08月15日 16:33:00]
<mira|AO> 16:31⎜«mfabian» Sollten wir  das als Bug gegen den Unicode  Standard
          reporten?    ← ne, Unicode properties und fixed-width terminal column
          size sind nicht deckungsgleich [17年08月15日 16:33:20]
<mira|AO> hm, moment [17年08月15日 16:33:31]
<mfabian> EastAsianWidth.txt ist  doch eigentlich genau dafür gedacht, für
          Terminals. [17年08月15日 16:33:55]
<mira|AO> ne, eben nicht [17年08月15日 16:34:02]
<mira|AO> die ist für komische alte asiatische DBCS und deren Mapping gedacht
                                                          [17年08月15日 16:34:11]
<mira|AO> sonst gäb’s zB ambiguous nicht [17年08月15日 16:34:19]
<mira|AO> also ja, sie ist _eine_ Datenquelle, aber für wcwidth nicht die
          einzige Quelle der Wahrheit [17年08月15日 16:34:35]
<mira|AO> so, moment, laß mich mal zusammenfassen [17年08月15日 16:35:01]
<mira|AO> 00AD: macht bei Terminals keinen Sinn, die trennen am Zeilenende,
          nicht bei soft hyphens; wenn ein Texteditor das intern verwendet,
          fein, aber ansonsten sind die ersten 256 Zeichen latin1, und latin1
          hat immer wcwidth=1 für Nichtsteuerzeichen, also explizite Abweichung
          Unicode <-> wcwidth [17年08月15日 16:35:54]
<mfabian> Ja, die "A" Zeichen  sind eigentlich in “komische alte asiatische
          DBCS” width 2, ansonsten width 1. [17年08月15日 16:35:56]
<mira|AO> 1160-11FF: siehe oben [17年08月15日 16:36:00]
<mira|AO> 1160-11FF kannst du ja mal bei Unicode einkippen als Nachfrage, wenn
          Du magst [17年08月15日 16:36:11]
<mira|AO> 3248-324F: vermutlich Unicode-Bug [17年08月15日 16:36:27]
<mira|AO> 4DC0-4DFF: würde ich Unicode mal vorschlagen zu ändern; sie haben ab 9
          (IIRC) ja auch die Emoji endlich mal mit width 2 versehen; in
          halfwidth machen die rendertechnisch keinen Sinn, dafür sind beinahe
          alle Pixelfonts zu klein [17年08月15日 16:37:04]
<mira|AO> und es sind ja auch von der Natur der Sache her asiatische Zeichen
                                                          [17年08月15日 16:37:16]
<mira|AO> war mir halt aufgefallen, weil ich U+3000 (als Leerzeichen) und ䷀ in
          meinem Fonteditor nutze, da exakt quadratisch [17年08月15日 16:37:32]
<mfabian> Richtig, die emoji machen in width 1 wirklich keinen Sinn.
                                                          [17年08月15日 16:37:33]
<mira|AO> yep, zudem zB KDE konsole die trotzdem in 2 gerendert hat, was bei
          MIschung ASCII/Emoji auf einer Zeile zu tollen Effekten geführt hatte…
                                                          [17年08月15日 16:38:07]
<mira|AO> also, ich würd alle vier in glibc als „wcwidth, historische Gründe“ so
          ändern, und die unteren drei mal bei Unicode einkippen
                                                          [17年08月15日 16:38:32]
<mira|AO> ich hatte mir halt damals die Mühe gemacht, nachzuvollziehen, was mgk
          gegenüber _seiner_ Unicode-Basisversion geändert hatte, und das zu
          dokumentieren (im Code) [17年08月15日 16:39:03]
<mira|AO> hab ihn auch mal gefragt, ob er’s aktualisieren will, aber er meinte,
          daß er da keinen Sinn drin sähe, weil’s dann ja nicht mehr einheitlich
          wäre [17年08月15日 16:39:44]
<mfabian> ䷀ U+4DC0 sehe ich hier  im IRC gerade als single width ("DejaVu Sans
          Mono" font) [17年08月15日 16:39:53]
<mira|AO> aber das Rendering geht dann halt kaputt, und daher hab ich mir das
          mal angeführt [17年08月15日 16:39:56]
<mira|AO> ich hab Fixed Misc 9x18 / 18x18, daher [17年08月15日 16:40:05]
<mira|AO> aber nimm dir mal z.B. den xterm-Standardfont 6x13
                                                          [17年08月15日 16:40:22]
<mira|AO> da haste einfach für manche von den Hexagrammen nicht genug Pixel
                                                          [17年08月15日 16:40:35]
<mfabian> Ich  weiß nicht ob die alten pixel Fonts heute noch relevant sind ...
                                                          [17年08月15日 16:40:52]
<mira|AO> klar sind sie das [17年08月15日 16:41:03]
<mira|AO> für die ist wcwidth ja designed [17年08月15日 16:41:09]
<mira|AO> und in Terminals sind die maßgebene [17年08月15日 16:41:16]
<mira|AO> d [17年08月15日 16:41:17]
<mfabian> Na ja, ich verwende schon seit Jahren nur noch skalierbare Fonts, auch
          in Terminals. [17年08月15日 16:41:37]
<mira|AO> nur mit denen kriegste kram wie linienzeichen ┌─┐ und so korrekt
                                                          [17年08月15日 16:41:38]
<mira|AO> mh aber viele eben auch nicht [17年08月15日 16:41:47]
<mira|AO> ich mach viel mit UTF-8-Textdateien [17年08月15日 16:41:53]
<mira|AO> lange Zeit konnte nur xterm die richtig darstellen, in letzter Zeit
          haben andere teilweise nachgezogen, aber xterm ist da Maß der Dinge
                                                          [17年08月15日 16:42:14]
<mira|AO> FixedMisc hat >40k Zeichen [17年08月15日 16:42:24]
<mira|AO> das hilft [17年08月15日 16:42:26]
<mfabian> Und für DejaVu Sans Mono  würde double width für die Hexagramme dann
          eher falsch sein. Ich weiß aber auch nicht, wie relevant dieser Font
          nun ist ... [17年08月15日 16:42:29]
<mira|AO> plus man kann ihn zB in grub, syslinux, … nehmen
                                                          [17年08月15日 16:42:34]
<mira|AO> ich glaube, daß die auch in fullwidth (nicht double width) in dejavu
          (das ist Bitstream, ja?) gehen [17年08月15日 16:42:57]
<mira|AO> mal kurz ein xterm damit gestartet, würde visuell gehen
                                                          [17年08月15日 16:43:48]
<mira|AO> auch in FixedMisc sind die etwas weniger hoch als ein Großbuchstabe,
          aber halt quadratisch [17年08月15日 16:44:01]
<mira|AO> wie man bei asiatischen Zeichen ja erwartet (ob der leichteren
          Lesbarkeit; viele Asiaten haben mit Proportionalschrift richtig
          Probleme, sagte man mir) [17年08月15日 16:44:26]
<mira|AO> naja, und zuguterletzt bleibt ja das Kompatibilitätsargument…
                                                          [17年08月15日 16:44:35]
<mfabian> http://imgur.com/a/VhDKb [17年08月15日 16:45:10]
<mira|AO> urgs, das sieht kaputt aus [17年08月15日 16:45:48]
<mira|AO> mom [17年08月15日 16:45:56]
<mfabian> Offensichtlich single width aber. [17年08月15日 16:45:58]
<mfabian> Hier im symbola font auch eher nicht double width:
          http://imgur.com/a/RLV7A [17年08月15日 16:46:14]
<mfabian> Double width Zeichen sollten eigentlich quadratisch sein.
                                                          [17年08月15日 16:46:45]
<mira|AO> boah, ich hasse Webseiten [17年08月15日 16:48:04]
<mira|AO> bei den meisten TTF ist das ja eh egal, die sind alle irgendwo
          dazwischen… [17年08月15日 16:48:27]
<mfabian> Na ja, die meisten Zeichen, die ganz sicher Double Width sind, so wie
          zum Beispiel Kanji sind wirklich quadratisch. [17年08月15日 16:49:05]
<mira|AO> http://i.imgur.com/JvrNiVN.png [17年08月15日 16:49:10]
<mfabian> Ja, in Deinem Font  ist das Hexagram wirklich quadratisch.
                                                          [17年08月15日 16:49:35]
<mira|AO> auf https://en.wikipedia.org/wiki/List_of_hexagrams_of_the_I_Ching
          sind sie das auch [17年08月15日 16:49:59]
<mira|AO> und vmtl. traditionell auch eher [17年08月15日 16:50:07]
<mfabian> Da die Hexagramme aber auch in den Fonts, wo sie nicht quadratisch
          sind, doch eher etwas breiter  sind, funktioniert es mit double width
          aber wohl trotzdem besser in Terminals. [17年08月15日 16:50:23]
<mira|AO> eben [17年08月15日 16:50:29]
<mira|AO> plus, kompatibilität… [17年08月15日 16:50:35]
<mfabian> Also auf der Wikipediaseite sind die Hexagramme nicht quadratisch bei
          mir, wahrscheinlich weil ich andere Fonts habe. [17年08月15日 16:51:10]
<mfabian> Also ich denke, ich werden Deinen Patch  einfach so pushen wie er ist,
          und dann mal bei Unicode nachfragen, was die dazu meinen. 
                                                          [17年08月15日 16:51:38]
<mira|AO> nein, in den Bildern rechts [17年08月15日 16:52:01]
<mfabian> Ja, OK, die Bilder rechts sind fast quadratisch.
                                                          [17年08月15日 16:52:36]
<mira|AO> die Breite auf der Seite ergibt sich direkt aus der Breite in den
          (Proportional-)Fonts, mit denen Dein Buntbrowser die rendert, oder der
          in wcwidth im zugrundeliegenden Terminal bei lynx u.ä.
                                                          [17年08月15日 16:52:44]
<mira|AO> ich würde ggfs. den Fontdesignern anraten, sie quadratisch zu machen,
          bei Proportionalfonts sind sie dann ggfs. immer noch was schmaler als
          das genaue Doppelte von z.B. einem W, aber das liegt ja in der Natur
          der Sache [17年08月15日 16:53:25]
<mira|AO> bei nichtproportionalen Vektorschriften ist sowas eh’ immer spannend,
          aber die Fontdesigner können das hinreichend [17年08月15日 16:53:41]
<mira|AO> in http://unicode.org/charts/PDF/U4DC0.pdf sind sie marginal schmaler
          als quadratisch, aber deutlich breiter als halfwidth
                                                          [17年08月15日 16:54:40]
<mfabian> Ich werde  das  mal nach glibc  master  pushen, dann kommt es aber
          erst im 2.27 Release. [17年08月15日 16:54:44]
<mira|AO> okay [17年08月15日 16:54:48]
<mira|AO> das hilft immerhin schonmal [17年08月15日 16:54:52]
<mira|AO> was man danach noch machen könnte: [17年08月15日 16:55:00]
<mira|AO> die width-Daten durchgehen (ich hab das bei mir lokal in Shell
          geskriptet) und aufeinanderfolgende gleiche Breiten zu Ranges
          zusammenfassen [17年08月15日 16:55:23]
<mfabian> Das könnte ich ja auch in das Generator Skript einbauen  ...
                                                          [17年08月15日 16:55:57]
<mira|AO> wollte aber den Patch nicht aufblähen, und nachher muß ich noch mal
          ’ne Runde durch die FSF copyright-Papiere (für andere FSF-Software hab
          ich die zwar schon, aber kennst die ja…), und bei den kleinen
          Änderungen in dem Patch sind sie eher noch trivial
                                                          [17年08月15日 16:55:58]
<mira|AO> oder so, ja [17年08月15日 16:56:02]
* mira|AO kann halt Korn Shell besser als Python [17年08月15日 16:56:15]
<mfabian> Ich pushe Deinen Patch erstmal so, und wenn ich dann noch ein bißchen
          Zeit habe mache ich das mit dem zusammenfassen der Ranges noch.
                                                          [17年08月15日 16:56:42]
<mira|AO> zum Vergleich: [17年08月15日 16:57:00]
<mfabian> Auf jeden Fall vielen Dank für die Patches! [17年08月15日 16:57:09]
<mira|AO>
          http://www.mirbsd.org/~tg/Debs/dists/jessie/wtf/Pkgs/mirabilos-support/mirabilos-support_40_all.tar.gz
          enthält im tarball unter mirabilos-support-40/examples/UTF-8.gz das,
          was bei mir hinten rauskam [17年08月15日 16:57:13]
<mira|AO> gern, und danke fürs Entgegenkommen [17年08月15日 16:57:28]
<mira|AO> ist ja nicht immer so, daß man auf Externe mal hört ;-)
                                                          [17年08月15日 16:57:40]
<mira|AO> hm, was wir noch tun könnten wäre mal das Ergebnis von wcwidth() über
          alle Zeichen zu vergleichen, aber AFAICT sollten die danach bei
          MirBSD-current und glibc gleich sein (für alle Zeichen, die definiert
          sind; ihr gebt ja wegen der Testsuite für undefinierte nix zurück)
                                                          [17年08月15日 16:58:36]
<mira|AO> wir haben da aktuell noch die EAW-Ranges als width 2 vordefiniert,
          aber mit der geplanten Codeänderung dann auch nicht mehr (dann wechsle
          ich von binary search auf indizierte Tabellenlookups)
                                                          [17年08月15日 16:59:18]
<mira|AO> falls Du mal lokal mit ungefähr demselben Font wie dem, den ich habe,
          testen willst: [17年08月15日 17:01:02]
<mira|AO> uxterm -fn -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
          -fw -misc-fixed-medium-r-normal-ko-18-120-100-100-c-180-iso10646-1
                                                          [17年08月15日 17:01:41]
<mira|AO> aus xfonts-base [17年08月15日 17:01:50]
<mfabian> Danke! [17年08月15日 17:02:54]
<mira|AO> Danke auch Dir! [17年08月15日 17:03:01]

Comment 5 Mike FABIAN 2017-08-16 14:13:49 UTC

Summary of the chatlog in the last comment

https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c4

in English:

Thorsten and me agree that setting the width of U+3248..U+324F
to 2 because the glyphs for these characters are quadratic in
most fonts.

(I also asked on the
Unicode mailing list now whether this could be a bug in
the Unicode data: http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0007.html
But even if it is not a bug, setting these to 2 seems to be
much better for users of terminals and that is what wcwidth
in glibc is mostly used for after all).

We also agree to set the width of the hexagrams U+4DC0..U+4DFF is
considerably wider than single width in most fonts. In some
classic Xorg fonts they are fully double width. In most scalable fonts
they are somewhat narrower than double width but considerabely wider
then single width. So marking them as width 1 would cause
problems in terminals, even if they are not fully double width
it makes sense to mark them as width 2 because they certainly
won’t fit in a single character cell in a terminal.

We also agree that the Hangul Jamo U+1160‥U+11FF are sort
of "combining characters" although they are not marked as such
in the Unicode data. But they are fragments of Hangul characters
which combine. So it seems correct to mark them as width 0.

Comment 6 Mike FABIAN 2017-08-16 14:17:00 UTC

And we also agree that setting the width of the soft hyphen U+00AD
to 0 as in Unicode seems to be not helpful for terminal
applications and as wcwidth is mostly important for terminal
applications, it makes sense to keep set the width of U+00AD to
1 as it "historically" always was in wcwidth.

Comment 7 Egmont Koblinger 2017-08-16 15:28:04 UTC

(In reply to Mike FABIAN from comment #5)

> [...] setting these to 2 seems to be
> much better for users of terminals and that is what wcwidth
> in glibc is mostly used for after all).

Guys,

With a huge thanks and great respect towards you working on addressing these issues, allow me please firmly oppose against deviating from the Unicode database.

The width is probably indeed primarly used by terminal emulators and apps running inside. They, however, use all kinds of various sources for this data, not just glibc's wcwidth().

For example, VTE-based emulators (such as GNOME Terminal) rely on glib's g_unichar_iswide(), see [1]. Alas I don't have any usage metrics, but the poll at [2] suggest that VTE's usage share amongst terminal emulators on Linux might be somewhere in the ballpark of 50%.

As for apps, if my memories are correct, I believe Vim uses its own built-in database rather than wcwidth(). So does the Joe text editor [3] (okay, it's a really marginal one), and presumably many more apps.

Let alone all other non-glibc based systems with their own wcwidth() implementation that one might ssh to/from.

For apps inside terminal emulators to work correctly, it's crucial that all the relevant components agree on the width. This has caused quite a headache when Unicode 9.0 changed the width of plenty of codepoints, see e.g. the bugreport with animgif at [4] (and tons of duplicates in other bugzillas and stackoverflow forums), but this is going to fade away as eventually everyone's upgrading their Unicode version.

You cannot, however, reasonably assume that other folks out there, i.e. terminal emulators as well as applications that don't rely on wcwidth() but some other data source, or those other data sources such as glib and probably a whole lot more, are all going to apply your modifications. And then again we haven't talked about ssh'ing to/from non-glibc systems.

If a certain glyph does not fit in its designated character cell, most terminal emulators will overflow it to the next cell. A slight overflow happens at way more codepoints than the ones debated now, e.g. in case of VTE and a not too large font, even the antialiasing of English letters such as 'W', 'm' overflows to the next cell. Of course I understand that the overflow in case of U+3248 "㉈" and friends is way more prominent, potentially causing the given and the subsequent glyph not to be readable at all, which is indeed bad.

But causing the entire canvas's contents to fall apart is even worse. And that's what typically happens when players of the game disagree on the width, as seen e.g. again at [4].

If you'd really like to see these particular codepoints becoming double wide (which I'm also in favor of), I firmly believe this change should be made in the Unicode database first, so that eventually everyone implementing a wcwidth()-like method gets that update; rather than just glibc, resulting in a long-term disagreement between parties and in turn inevitable corruption of the entire terminal window in quite a few terminal emulators and apps.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=772890
[2] https://opensource.com/life/15/11/top-open-source-terminal-emulators
[3] https://sourceforge.net/p/joe-editor/bugs/363/
[4] https://github.com/powerline/powerline/issues/1652

Comment 8 Thorsten Glaser 2017-08-16 18:18:44 UTC

Hi Egmont,

only a short response because we have FrOSCon/FrogLabs preparations and workshop until Monday:

We’re not strictly speaking deviating from UCD because UCD does *not* define wcwidth.

Terminal emulators use wcwidth, especially xterm uses ONLY it *and* defines it.

Applications such as editors in the terminal (cf. jupp) use wcwidth or carry their own data which is prepared the same way as wcwidth (often they use a copy of xterm's code).

You speak of compatibility and breaking. Strictly speaking, the switch glibc recently (two or three majors, I think) did to regenerated data *did* break applications, and this bugreport is 100% returning the glibc data to the way it was before in the places the previous change introduced bugs, while still keeping it up-to-date with recent Unicode.

So, therefore, with this patch applied, less things will break than without.

Outlyers like libglib (used by only one of the multitude of terminal emulators) can then import the data (and mechanism used to generate) from here.

Other systems use the old wcwidth code from xterm, to which this one (with my patches applied) is compatible for all chars that did not get changed in or added to Unicode, which is the maximum compatibility and an easily to achieved, and achievable and should-be-achieved goal.

Comment 9 Sourceware Commits 2017-08-17 09:07:13 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  bb6274ee1293a6bc76d9d7c889783303de181295 (commit)
       via  c14b84baae83bfb73f7cd00ba7c24964ad1c712c (commit)
       via  7a79e321c6f85b204036c33d85f6b2aa794e7c76 (commit)
       via  267ee5d7ab57591a6b1bc2d2a010c88188427063 (commit)
       via  41b6f0ce85d98c62739b04863e8c38a1f4154e80 (commit)
       via  580be3035d2e0f479c4ac955bf719b0bf936f5cf (commit)
      from  038d1cafafb3094a9fbebd35f4aa8d0ebae0e55b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb6274ee1293a6bc76d9d7c889783303de181295

commit bb6274ee1293a6bc76d9d7c889783303de181295
Author: Akhilesh Kumar <akhilesh.k@samsung.com>
Date:   Wed Aug 16 15:33:58 2017 +0530

    Fix abmon for bem_ZM
    
    Until now the abbreviated month names were in English.
    
    	[BZ #21960]
    	* locales/bem_ZM (LC_TIME): Fix abmon, make it agree with CLDR.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c14b84baae83bfb73f7cd00ba7c24964ad1c712c

commit c14b84baae83bfb73f7cd00ba7c24964ad1c712c
Author: Akhilesh Kumar <akhilesh.k@samsung.com>
Date:   Wed Aug 16 18:01:53 2017 +0530

    Fix country name for xh_ZA
    
    	[BZ #21959]
    	* locales/xh_ZA (LC_ADDRESS): Fix country name.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a79e321c6f85b204036c33d85f6b2aa794e7c76

commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog
    
    	[BZ #21750]
    	* charmaps/UTF-8: Refresh.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=267ee5d7ab57591a6b1bc2d2a010c88188427063

commit 267ee5d7ab57591a6b1bc2d2a010c88188427063
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:46 2017 +0200

    Resolve some historically special cases of ambiguous width
    
    [BZ #21750]
    * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
    * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
    * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
    * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=41b6f0ce85d98c62739b04863e8c38a1f4154e80

commit 41b6f0ce85d98c62739b04863e8c38a1f4154e80
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:44 2017 +0200

    Handle more cases of combining characters
    
    [BZ #21750]
    * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=580be3035d2e0f479c4ac955bf719b0bf936f5cf

commit 580be3035d2e0f479c4ac955bf719b0bf936f5cf
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:37 2017 +0200

    UnicodeData has precedence over EastAsianWidth
    
    [BZ #19852]
    [BZ #21750]
    * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
      UnicodeData lines so the latter have precedence; remove hack
      to group output by EastAsianWidth ranges.

-----------------------------------------------------------------------

Summary of changes:
 localedata/ChangeLog               |   24 +
 localedata/charmaps/UTF-8          |111468 +++++++++++++++++++++++++++++++++++-
 localedata/locales/bem_ZM          |   25 +-
 localedata/locales/xh_ZA           |    5 +-
 localedata/unicode-gen/utf8_gen.py |   38 +-
 5 files changed, 111400 insertions(+), 160 deletions(-)

Comment 10 Mike FABIAN 2017-08-17 13:46:12 UTC

FIXED.

Comment 11 Andreas Schwab 2017-08-18 06:27:13 UTC

Thorsten Glaser does not have an assignment for glibc on file, we cannot accept his contributions until this is sorted out.

Comment 12 Egmont Koblinger 2017-08-18 10:23:17 UTC

(In reply to Thorsten Glaser from comment #8)

> We’re not strictly speaking deviating from UCD because UCD does *not* define
> wcwidth.

Well, it defines the East_Asian_Width property from which you derive wcwidth using a couple of generic rules plus a few exceptions to them.

You've just (re?)added 3248..324F and a few other ranges to these exceptions, which in my eyes means that yes, you are deviating from Unicode.

> Terminal emulators use wcwidth, especially xterm uses ONLY it *and* defines
> it.
> 
> Applications such as editors in the terminal (cf. jupp) use wcwidth or carry
> their own data which is prepared the same way as wcwidth (often they use a
> copy of xterm's code).

To be more precise, xterm and a few others copy Markus Kuhn's implementation. I don't think anyone copies from xterm.

This defines the 3248..324F range as ambiguous (I've checked the most recent xterm-330 and a randomly chosen ~4 year old xterm-300 – a randomly picked even older xterm-260 is different which suggests that case xterm has a long ago caught up with the changes), which, by default, means it is 1 cell wide in xterm (unless -cjk_width is specified in which case all other ambiguous ones are turned into double)...

> You speak of compatibility and breaking. Strictly speaking, the switch glibc
> recently (two or three majors, I think) did to regenerated data *did* break
> applications, and this bugreport is 100% returning the glibc data to the way
> it was before in the places the previous change introduced bugs, while still
> keeping it up-to-date with recent Unicode.
> 
> So, therefore, with this patch applied, less things will break than without.

... so I absolutely don't get why less things would be broken now. As far as I can see, with this patch you have just further broken the handling of these codepoints by deviating from Unicode and from xterm.

> Outlyers like libglib (used by only one of the multitude of terminal
> emulators) can then import the data (and mechanism used to generate) from
> here.

You really don't seriously expect that two glibc maintainers decide over a chat that they add a few exceptions to the generic rules, and "outlyers" (like glib, maybe Qt, maybe Java, maybe some other "giant" pieces of (perhaps commercial) software, maybe other libc implementations of other Unices (like Mac), maybe a whole lot more) will follow; do you??

(And on a side note... IMHO submitting a change right after someone brings up some concerns, not even giving time for a reasonable discussion, isn't really a polite thing... Especially since recently it took me about 2 years and about 10-15 pings that were left unanswered to get through a well unittested locale change, I can't understand why this hurry now.)

Comment 13 Sourceware Commits 2017-08-18 13:17:08 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  486afa6d27156665959e59b86e7aad18c3832cbe (commit)
      from  a3fe6a20bf81ef6a97a761dac9050517e7fd7a1f (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=486afa6d27156665959e59b86e7aad18c3832cbe

commit 486afa6d27156665959e59b86e7aad18c3832cbe
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Fri Aug 18 13:41:34 2017 +0200

    Use the range notation in charmaps/UTF-8 for all ranges of neighbouring characters with the same width
    
    	[BZ #21750]
    	* charmaps/UTF-8: Use the range notation for all ranges
    	of neighbouring characters with the same width.

-----------------------------------------------------------------------

Summary of changes:
 localedata/ChangeLog      |    6 +
 localedata/charmaps/UTF-8 |113545 +--------------------------------------------
 2 files changed, 300 insertions(+), 113251 deletions(-)

Comment 14 Mike Frysinger 2017-09-03 16:33:27 UTC

this bug report has a lot of things in it.  i think each request in the original post should be split out into sep reports.  other than overall discussion about keeping things in sync, it's impossible to follow discussion about specific codepoints.

wrt U+00AD: https://www.cs.tut.fi/~jkorpela/shy.html

Comment 15 Mike Frysinger 2017-09-03 21:03:41 UTC

i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug 22074.  feel free to take follow ups for those topics to those respective bugs so the discussion can stay focused and not get cluttered up.

i haven't looked into the other codepoints raised in the original comment, so if they aren't resolved, feel free to fork them out too.

Comment 16 Mike FABIAN 2017-09-04 09:51:18 UTC

(In reply to Mike Frysinger from comment #15)
> i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug
> 22074.  feel free to take follow ups for those topics to those respective
> bugs so the discussion can stay focused and not get cluttered up.
> 
> i haven't looked into the other codepoints raised in the original comment,
> so if they aren't resolved, feel free to fork them out too.

For the code points

3248..324F;A # No [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE 

I asked on the unicode mailing list:

http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0007.html

And the response makes me think that we are free to use wcwidth 2 for
these in glibc if that fits our “context” best:

http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0023.html

> "A" means, you get to decide whether to treat these as "W" or "N" based on context.
>
> There's really not strong need to change an "A" towards "W", because
> "A" doesn't get in your way if you decided that "W" works better for
> you.
>
> Remember that all the EAW properties ares supposed to be "resolved"
> down to W or N. For some, like Na that resolution is deterministic,
> for A it is context/application dependent, but when you finally
> process your data, only W(ide) or N(arrow) remain after resolution.

Comment 17 Sourceware Commits 2017-09-06 11:14:40 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  2ae5be041d9ea89cdd0f37734d72051e8f773947 (commit)
       via  af83ed5c4647bda196fc1a7efebbe8019aa83f4a (commit)
      from  4f3647e46e3f645c6516faa299efc6e89d520d7b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2ae5be041d9ea89cdd0f37734d72051e8f773947

commit 2ae5be041d9ea89cdd0f37734d72051e8f773947
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Sep 6 11:19:33 2017 +0200

    Improve utf8_gen.py to set the width for characters with Prepended_Concatenation_Mark property to 1
    
    	[BZ #22070]
    	* localedata/unicode-gen/utf8_gen.py: Set the width for
    	characters with Prepended_Concatenation_Mark property to 1
    	* localedata/charmaps/UTF-8: Updated using the improved script.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af83ed5c4647bda196fc1a7efebbe8019aa83f4a

commit af83ed5c4647bda196fc1a7efebbe8019aa83f4a
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Fri Aug 18 10:12:29 2017 +0200

    Write all ranges of neighbouring characters with the same width using the range notation in charmaps/UTF-8
    
    Writing ranges of neighbouring characters with the same with like this
    
        <U000E0100>...<U000E01EF>	0
    
    in charmaps/UTF-8 is more efficient than writing many single character lines
    like:
    
        <U000E0100>	0
        <U000E0101>	0
        ...
    
    	[BZ #21750]
    	* unicode-gen/utf8_gen.py: Write all ranges of neighbouring characters
    	with the same width using the range notation in charmaps/UTF-8.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                           |   14 +
 localedata/charmaps/UTF-8           |   10 +-
 localedata/unicode-gen/Makefile     |    4 +-
 localedata/unicode-gen/PropList.txt | 1618 +++++++++++++++++++++++++++++++++++
 localedata/unicode-gen/utf8_gen.py  |   84 ++-
 5 files changed, 1704 insertions(+), 26 deletions(-)
 create mode 100644 localedata/unicode-gen/PropList.txt

Comment 18 Mike FABIAN 2017-09-14 13:45:33 UTC

(In reply to Mike Frysinger from comment #15)
> i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug
> 22074.  feel free to take follow ups for those topics to those respective
> bugs so the discussion can stay focused and not get cluttered up.
> 
> i haven't looked into the other codepoints raised in the original comment,
> so if they aren't resolved, feel free to fork them out too.

I think there is nothing more to do in this bug here, 
therefore I close it as FIXED.

(Copyright assignment by Thorsen Glaser is underway).

Comment 19 Thorsten Glaser 2017-09-14 16:38:47 UTC

I submitted it on Wed, 6 Sep 2017 15:15:38 +0000 (UTC)