I’ve compared the new autogenerated column width from localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() implementation from xterm (adjusted to Unicode 10.0.0) and found a few divergences (and bugs on my (MirBSD, which uses something based on xterm’s data system-wide) side, which I fixed). 1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which, when displayed as 8bit on terminals, had no combining characters at all. Change Request to glibc: force U+00AD to width 1. 2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class: U+0CBF U+0CC6 U-00011C3F This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same as Cf/NSM, which I do. Change Request to glibc: handle Me/Mn category the same as NSM bidi class. 3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they combine on top of the preceding initial ones: U+1160‥U+11FF Change Request to glibc: force U+1160‥U+11FF to width 0. 4. During parsing, EastAsianWidth data overrides UCD data, more specifically the NSM property. This leads to U+302A‥U+302D and – see also https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being treated as width 2. Change Request to glibc: read EAW before UCD so the NSM overrides EAW here. 5. Ambiguous circled numbers and neutral hexagrams changed width xterm used to set those to width 2, likely because they are ideographs and not unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD nowadays) Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2. Note: I’ve initially reported the surprising change to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the research today (against 2.24 in Debian and git master commit 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10, double-checked *all* differences against MirBSD code and fixed a few bugs there after making it possible to compare the results (considering glibc only puts actually assigned codepoints into the localedata/charmaps/UTF-8 file). Rationale for requesting the change in glibc is so that all systems I have access to use the same width data, preventing display artifacts and glitches up to making an editor somewhat unusable with heavy Unicode (I have test files containing the entire Unicode range). Thank you for listening. If necessary, I will provide patches (to utf8_gen.py most likely) when asked.
Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible, i.e. zero columns? If an app breaks up words at end-of-line, it can use the soft hyphens as helpers to detect the correct locations. The app can then add a visible hyphen to the end of the line. (If the app also reads from the terminal, then it can e.g. ignore visible hyphens when preceded by a soft hyphen, or use some other mechanism to mark the character as for terminal display only). I am not suggesting a change, if xterm etc. multitude of apps are already handling soft hyphens in some other manner, just wondering. Troy On Tue, 2017-07-11 at 14:18 +0000, tg at mirbsd dot de wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=21750 > > Bug ID: 21750 > Summary: column width of characters incompatible with classical > wcwidth > Product: glibc > Version: 2.26 > Status: UNCONFIRMED > Severity: normal > Priority: P2 > Component: localedata > Assignee: unassigned at sourceware dot org > Reporter: tg at mirbsd dot de > CC: libc-locales at sourceware dot org > Target Milestone: --- > > I’ve compared the new autogenerated column width from > localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() > implementation from xterm (adjusted to Unicode 10.0.0) and found a few > divergences (and bugs on my (MirBSD, which uses something based on xterm’s data > system-wide) side, which I fixed). > > 1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc > > Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which, > when displayed as 8bit on terminals, had no combining characters at all. > > Change Request to glibc: force U+00AD to width 1. > > 2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class: > U+0CBF U+0CC6 U-00011C3F > > This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same > as Cf/NSM, which I do. > > Change Request to glibc: handle Me/Mn category the same as NSM bidi class. > > 3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they > combine on top of the preceding initial ones: U+1160‥U+11FF > > Change Request to glibc: force U+1160‥U+11FF to width 0. > > 4. During parsing, EastAsianWidth data overrides UCD data, more specifically > the NSM property. > > This leads to U+302A‥U+302D and – see also > https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being > treated as width 2. > > Change Request to glibc: read EAW before UCD so the NSM overrides EAW here. > > 5. Ambiguous circled numbers and neutral hexagrams changed width > > xterm used to set those to width 2, likely because they are ideographs and not > unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD > nowadays) > > Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2. > > > Note: I’ve initially reported the surprising change to Debian as > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the > research today (against 2.24 in Debian and git master commit > 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10, > double-checked *all* differences against MirBSD code and fixed a few bugs there > after making it possible to compare the results (considering glibc only puts > actually assigned codepoints into the localedata/charmaps/UTF-8 file). > > Rationale for requesting the change in glibc is so that all systems I have > access to use the same width data, preventing display artifacts and glitches up > to making an editor somewhat unusable with heavy Unicode (I have test files > containing the entire Unicode range). Thank you for listening. > > If necessary, I will provide patches (to utf8_gen.py most likely) when asked. >
(In reply to Troy Korjuslommi from comment #1) > Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible, > i.e. zero columns? If an app breaks up words at end-of-line, it can use > the soft hyphens as helpers to detect the correct locations. The app can Yes, in theory. This codepoint could be used in the *input data* to determine soft breaks. However (see below) they should *not* output those to a terminal emulator (GUIs that handle this themselves are likely fine). > I am not suggesting a change, if xterm etc. multitude of apps are > already handling soft hyphens in some other manner, just wondering. Similar to U+0060 (the gravis accent 「`」) however, terminal emulators have been treating both ASCII (for U+0060) and 8-bit codepages like ISO 8859-1 (for U+00AD) as each (non-control) character having a constant width of 1 (for SBCS), and xterm’s wcwidth() code had special handling to force U+00AD to 1: /* […] * - SOFT HYPHEN (U+00AD) has a column width of 1. […] */ […] /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */ Source: http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=1.1.103.1;content-type=text%2Fplain So you’d want to output U+0060 U+0008 U+0061 (` + backspace + a) to get à on a (printed) terminal (or in code that uses such to emulate them), and similarily, strip soft hyphens from the output (or manifest them as regular ones) before outputting a soft-wrapped text (mostly because the terminal emulator will also not soft-wrap, it’ll break at the end of the line, so you’d convert U+00AD to some kind of hyphen (hyphen-minus or U+2010 perhaps) followed by a line break(⚠) if preparing something fopr terminal output). I’ve noticed the incompatibilities especially when the hexagrams, one of which I’m using for UI purposes, changed width, and tried to discover all of them in order to harmonise the width assumptions the various programs I have access to use on all systems I use, with classical xterm wcwidth.c as base, since those widths are the domain of a fixed-cell terminal emulator more than something else (which can use its own data, if necessary). I do volunteer to provide patches, here and elsewhere, so that, with the same UCD version as input, we get consistent output (and I’ve sanity-checked the output I got before opening this report).
Created attachment 10257 [details] tarball of “git am”able patches I’ve done the patches and compared the output, which Looks Good To Me™. Please apply.
(In reply to Thorsten Glaser from comment #3) > Created attachment 10257 [details] > tarball of “git am”able patches > > I’ve done the patches and compared the output, which Looks Good To Me™. > Please apply. The following is a chatlog of the discussion I had with Thorsten about these patches. I also think the patches are OK, but I wanted to make sure because they make the width data in glibc deviate a bit from the width data in Unicode. So here is the chatlog: <mfabian> mira|AO: Hallo! Von Dir ist der Patch in https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c3 , nicht wahr? [17年08月15日 15:50:08] <mira|AO> mfabian: ja [17年08月15日 16:13:39] <mira|AO> hab die charmaps/UTF-8.gz bei mir lokal auf Debian ausgetauscht, locales regeneriert, alles tut wie’s soll ☺ [17年08月15日 16:15:15] <mfabian> mira|AO: Ja, der Patch sieht gut aus! [17年08月15日 16:23:29] <mfabian> Ich bin nur nicht ganz sicher bei den 4 Ausnahmen, die absichtlich von den Daten in UnicodeData.txt und EastAsianWidth.txt abweisen, also: [17年08月15日 16:24:21] <mfabian> + * unicode-gen/utf8_gen.py (U+00AD): Set width to 1. [17年08月15日 16:24:24] <mfabian> + * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0. <mfabian> + * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2. <mfabian> + * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise. <mfabian> <mfabian> Die ersten beiden sind so, weil das in Markus Kuhn’s wcwidth so war ... [17年08月15日 16:24:46] <mfabian> https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c [17年08月15日 16:24:56] <mfabian> Aber warum sind dann U+1160 ... U+11FF ausdrücklich als "N" in EastAsianWidth.txt eingetragen? [17年08月15日 16:26:02] <mfabian> U+3248..U+324F auf width 2 zu setzen sieht vom Gefühl her richtig aus für mich, weil die Glyphen dazu in den meisten Fonts, die ich habe, tatsächlich quadratisch sind. [17年08月15日 16:26:54] <mfabian> EastAsianWidth.txt hat “3248..324F;A # No [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE”, also “ambiguous”, d.h. normalerweise width 1 aber width 2 in CJK legacy locales ... [17年08月15日 16:28:09] <mfabian> Wenn die in den meisten Fonts aber tatsächlich quadratisch sind, scheint es mir OK width auf 2 zu setzen. [17年08月15日 16:28:34] <mfabian> Bei den Hexagrammen “4DC0..4DFF;N # So [64] HEXAGRAM FOR THE CREATIVE HEAVEN..HEXAGRAM FOR BEFORE COMPLETION” kommt es mir komischer vor, im “Deja Vu Sans” font sind die zum Beispiel drin, und sind dort von der Breite irgendwo zwischen single width und double width. [17年08月15日 16:29:45] <mfabian> Also 0001-UnicodeData-has-precedence-over-EastAsianWidth.patch finde ich auf jeden Fall gut, und 0002-Handle-more-cases-of-combining-characters.patch auch. [17年08月15日 16:30:26] <mira|AO> mfabian: alle vier sind das [17年08月15日 16:30:35] <mfabian> Aber bei 0003-Resolve-some-historically-special-cases-of-ambiguous.patch bin ich unsicher ... [17年08月15日 16:30:40] <mira|AO> die Hexagramme gehen in narrow einfach gar nicht (schau dir 9x18 an, da basiere ich meine Arbeit an Fonts drauf) [17年08月15日 16:31:11] <mira|AO> 1160‥11FF, moment, ich schau grad [17年08月15日 16:31:25] <mfabian> Wir würden da vom Unicode Standard aus “historischen” Gründen abweichen? [17年08月15日 16:31:26] <mira|AO> achso [17年08月15日 16:31:31] <mfabian> Oder ist der Unicode Standard da falsch? [17年08月15日 16:31:38] <mira|AO> das ist, weil das koreanische Zeichenbestandteile sind [17年08月15日 16:31:40] <mira|AO> und zwar [17年08月15日 16:31:44] <mfabian> Sollten wir das als Bug gegen den Unicode Standard reporten? [17年08月15日 16:31:56] <mira|AO> du hast immer ein choseong (1100‥115F), ein jungseong und ein jongseong, und die zusammen ergeben ein zeichen [17年08月15日 16:32:20] <mira|AO> d.h. die kombinieren miteinander [17年08月15日 16:32:24] <mira|AO> individuell sind das alles koreanische Zeichen(fragmente), aber die 1160‥11FF sind halt bei mgk als combining, weil die so gerendert werden sollen, wenn die direkt hintereinander stehen [17年08月15日 16:32:52] <mfabian> Richtig, aber warum stehen sie in EastAsianWidth als "N" drin? [17年08月15日 16:33:00] <mira|AO> 16:31⎜«mfabian» Sollten wir das als Bug gegen den Unicode Standard reporten? ← ne, Unicode properties und fixed-width terminal column size sind nicht deckungsgleich [17年08月15日 16:33:20] <mira|AO> hm, moment [17年08月15日 16:33:31] <mfabian> EastAsianWidth.txt ist doch eigentlich genau dafür gedacht, für Terminals. [17年08月15日 16:33:55] <mira|AO> ne, eben nicht [17年08月15日 16:34:02] <mira|AO> die ist für komische alte asiatische DBCS und deren Mapping gedacht [17年08月15日 16:34:11] <mira|AO> sonst gäb’s zB ambiguous nicht [17年08月15日 16:34:19] <mira|AO> also ja, sie ist _eine_ Datenquelle, aber für wcwidth nicht die einzige Quelle der Wahrheit [17年08月15日 16:34:35] <mira|AO> so, moment, laß mich mal zusammenfassen [17年08月15日 16:35:01] <mira|AO> 00AD: macht bei Terminals keinen Sinn, die trennen am Zeilenende, nicht bei soft hyphens; wenn ein Texteditor das intern verwendet, fein, aber ansonsten sind die ersten 256 Zeichen latin1, und latin1 hat immer wcwidth=1 für Nichtsteuerzeichen, also explizite Abweichung Unicode <-> wcwidth [17年08月15日 16:35:54] <mfabian> Ja, die "A" Zeichen sind eigentlich in “komische alte asiatische DBCS” width 2, ansonsten width 1. [17年08月15日 16:35:56] <mira|AO> 1160-11FF: siehe oben [17年08月15日 16:36:00] <mira|AO> 1160-11FF kannst du ja mal bei Unicode einkippen als Nachfrage, wenn Du magst [17年08月15日 16:36:11] <mira|AO> 3248-324F: vermutlich Unicode-Bug [17年08月15日 16:36:27] <mira|AO> 4DC0-4DFF: würde ich Unicode mal vorschlagen zu ändern; sie haben ab 9 (IIRC) ja auch die Emoji endlich mal mit width 2 versehen; in halfwidth machen die rendertechnisch keinen Sinn, dafür sind beinahe alle Pixelfonts zu klein [17年08月15日 16:37:04] <mira|AO> und es sind ja auch von der Natur der Sache her asiatische Zeichen [17年08月15日 16:37:16] <mira|AO> war mir halt aufgefallen, weil ich U+3000 (als Leerzeichen) und ䷀ in meinem Fonteditor nutze, da exakt quadratisch [17年08月15日 16:37:32] <mfabian> Richtig, die emoji machen in width 1 wirklich keinen Sinn. [17年08月15日 16:37:33] <mira|AO> yep, zudem zB KDE konsole die trotzdem in 2 gerendert hat, was bei MIschung ASCII/Emoji auf einer Zeile zu tollen Effekten geführt hatte… [17年08月15日 16:38:07] <mira|AO> also, ich würd alle vier in glibc als „wcwidth, historische Gründe“ so ändern, und die unteren drei mal bei Unicode einkippen [17年08月15日 16:38:32] <mira|AO> ich hatte mir halt damals die Mühe gemacht, nachzuvollziehen, was mgk gegenüber _seiner_ Unicode-Basisversion geändert hatte, und das zu dokumentieren (im Code) [17年08月15日 16:39:03] <mira|AO> hab ihn auch mal gefragt, ob er’s aktualisieren will, aber er meinte, daß er da keinen Sinn drin sähe, weil’s dann ja nicht mehr einheitlich wäre [17年08月15日 16:39:44] <mfabian> ䷀ U+4DC0 sehe ich hier im IRC gerade als single width ("DejaVu Sans Mono" font) [17年08月15日 16:39:53] <mira|AO> aber das Rendering geht dann halt kaputt, und daher hab ich mir das mal angeführt [17年08月15日 16:39:56] <mira|AO> ich hab Fixed Misc 9x18 / 18x18, daher [17年08月15日 16:40:05] <mira|AO> aber nimm dir mal z.B. den xterm-Standardfont 6x13 [17年08月15日 16:40:22] <mira|AO> da haste einfach für manche von den Hexagrammen nicht genug Pixel [17年08月15日 16:40:35] <mfabian> Ich weiß nicht ob die alten pixel Fonts heute noch relevant sind ... [17年08月15日 16:40:52] <mira|AO> klar sind sie das [17年08月15日 16:41:03] <mira|AO> für die ist wcwidth ja designed [17年08月15日 16:41:09] <mira|AO> und in Terminals sind die maßgebene [17年08月15日 16:41:16] <mira|AO> d [17年08月15日 16:41:17] <mfabian> Na ja, ich verwende schon seit Jahren nur noch skalierbare Fonts, auch in Terminals. [17年08月15日 16:41:37] <mira|AO> nur mit denen kriegste kram wie linienzeichen ┌─┐ und so korrekt [17年08月15日 16:41:38] <mira|AO> mh aber viele eben auch nicht [17年08月15日 16:41:47] <mira|AO> ich mach viel mit UTF-8-Textdateien [17年08月15日 16:41:53] <mira|AO> lange Zeit konnte nur xterm die richtig darstellen, in letzter Zeit haben andere teilweise nachgezogen, aber xterm ist da Maß der Dinge [17年08月15日 16:42:14] <mira|AO> FixedMisc hat >40k Zeichen [17年08月15日 16:42:24] <mira|AO> das hilft [17年08月15日 16:42:26] <mfabian> Und für DejaVu Sans Mono würde double width für die Hexagramme dann eher falsch sein. Ich weiß aber auch nicht, wie relevant dieser Font nun ist ... [17年08月15日 16:42:29] <mira|AO> plus man kann ihn zB in grub, syslinux, … nehmen [17年08月15日 16:42:34] <mira|AO> ich glaube, daß die auch in fullwidth (nicht double width) in dejavu (das ist Bitstream, ja?) gehen [17年08月15日 16:42:57] <mira|AO> mal kurz ein xterm damit gestartet, würde visuell gehen [17年08月15日 16:43:48] <mira|AO> auch in FixedMisc sind die etwas weniger hoch als ein Großbuchstabe, aber halt quadratisch [17年08月15日 16:44:01] <mira|AO> wie man bei asiatischen Zeichen ja erwartet (ob der leichteren Lesbarkeit; viele Asiaten haben mit Proportionalschrift richtig Probleme, sagte man mir) [17年08月15日 16:44:26] <mira|AO> naja, und zuguterletzt bleibt ja das Kompatibilitätsargument… [17年08月15日 16:44:35] <mfabian> http://imgur.com/a/VhDKb [17年08月15日 16:45:10] <mira|AO> urgs, das sieht kaputt aus [17年08月15日 16:45:48] <mira|AO> mom [17年08月15日 16:45:56] <mfabian> Offensichtlich single width aber. [17年08月15日 16:45:58] <mfabian> Hier im symbola font auch eher nicht double width: http://imgur.com/a/RLV7A [17年08月15日 16:46:14] <mfabian> Double width Zeichen sollten eigentlich quadratisch sein. [17年08月15日 16:46:45] <mira|AO> boah, ich hasse Webseiten [17年08月15日 16:48:04] <mira|AO> bei den meisten TTF ist das ja eh egal, die sind alle irgendwo dazwischen… [17年08月15日 16:48:27] <mfabian> Na ja, die meisten Zeichen, die ganz sicher Double Width sind, so wie zum Beispiel Kanji sind wirklich quadratisch. [17年08月15日 16:49:05] <mira|AO> http://i.imgur.com/JvrNiVN.png [17年08月15日 16:49:10] <mfabian> Ja, in Deinem Font ist das Hexagram wirklich quadratisch. [17年08月15日 16:49:35] <mira|AO> auf https://en.wikipedia.org/wiki/List_of_hexagrams_of_the_I_Ching sind sie das auch [17年08月15日 16:49:59] <mira|AO> und vmtl. traditionell auch eher [17年08月15日 16:50:07] <mfabian> Da die Hexagramme aber auch in den Fonts, wo sie nicht quadratisch sind, doch eher etwas breiter sind, funktioniert es mit double width aber wohl trotzdem besser in Terminals. [17年08月15日 16:50:23] <mira|AO> eben [17年08月15日 16:50:29] <mira|AO> plus, kompatibilität… [17年08月15日 16:50:35] <mfabian> Also auf der Wikipediaseite sind die Hexagramme nicht quadratisch bei mir, wahrscheinlich weil ich andere Fonts habe. [17年08月15日 16:51:10] <mfabian> Also ich denke, ich werden Deinen Patch einfach so pushen wie er ist, und dann mal bei Unicode nachfragen, was die dazu meinen. [17年08月15日 16:51:38] <mira|AO> nein, in den Bildern rechts [17年08月15日 16:52:01] <mfabian> Ja, OK, die Bilder rechts sind fast quadratisch. [17年08月15日 16:52:36] <mira|AO> die Breite auf der Seite ergibt sich direkt aus der Breite in den (Proportional-)Fonts, mit denen Dein Buntbrowser die rendert, oder der in wcwidth im zugrundeliegenden Terminal bei lynx u.ä. [17年08月15日 16:52:44] <mira|AO> ich würde ggfs. den Fontdesignern anraten, sie quadratisch zu machen, bei Proportionalfonts sind sie dann ggfs. immer noch was schmaler als das genaue Doppelte von z.B. einem W, aber das liegt ja in der Natur der Sache [17年08月15日 16:53:25] <mira|AO> bei nichtproportionalen Vektorschriften ist sowas eh’ immer spannend, aber die Fontdesigner können das hinreichend [17年08月15日 16:53:41] <mira|AO> in http://unicode.org/charts/PDF/U4DC0.pdf sind sie marginal schmaler als quadratisch, aber deutlich breiter als halfwidth [17年08月15日 16:54:40] <mfabian> Ich werde das mal nach glibc master pushen, dann kommt es aber erst im 2.27 Release. [17年08月15日 16:54:44] <mira|AO> okay [17年08月15日 16:54:48] <mira|AO> das hilft immerhin schonmal [17年08月15日 16:54:52] <mira|AO> was man danach noch machen könnte: [17年08月15日 16:55:00] <mira|AO> die width-Daten durchgehen (ich hab das bei mir lokal in Shell geskriptet) und aufeinanderfolgende gleiche Breiten zu Ranges zusammenfassen [17年08月15日 16:55:23] <mfabian> Das könnte ich ja auch in das Generator Skript einbauen ... [17年08月15日 16:55:57] <mira|AO> wollte aber den Patch nicht aufblähen, und nachher muß ich noch mal ’ne Runde durch die FSF copyright-Papiere (für andere FSF-Software hab ich die zwar schon, aber kennst die ja…), und bei den kleinen Änderungen in dem Patch sind sie eher noch trivial [17年08月15日 16:55:58] <mira|AO> oder so, ja [17年08月15日 16:56:02] * mira|AO kann halt Korn Shell besser als Python [17年08月15日 16:56:15] <mfabian> Ich pushe Deinen Patch erstmal so, und wenn ich dann noch ein bißchen Zeit habe mache ich das mit dem zusammenfassen der Ranges noch. [17年08月15日 16:56:42] <mira|AO> zum Vergleich: [17年08月15日 16:57:00] <mfabian> Auf jeden Fall vielen Dank für die Patches! [17年08月15日 16:57:09] <mira|AO> http://www.mirbsd.org/~tg/Debs/dists/jessie/wtf/Pkgs/mirabilos-support/mirabilos-support_40_all.tar.gz enthält im tarball unter mirabilos-support-40/examples/UTF-8.gz das, was bei mir hinten rauskam [17年08月15日 16:57:13] <mira|AO> gern, und danke fürs Entgegenkommen [17年08月15日 16:57:28] <mira|AO> ist ja nicht immer so, daß man auf Externe mal hört ;-) [17年08月15日 16:57:40] <mira|AO> hm, was wir noch tun könnten wäre mal das Ergebnis von wcwidth() über alle Zeichen zu vergleichen, aber AFAICT sollten die danach bei MirBSD-current und glibc gleich sein (für alle Zeichen, die definiert sind; ihr gebt ja wegen der Testsuite für undefinierte nix zurück) [17年08月15日 16:58:36] <mira|AO> wir haben da aktuell noch die EAW-Ranges als width 2 vordefiniert, aber mit der geplanten Codeänderung dann auch nicht mehr (dann wechsle ich von binary search auf indizierte Tabellenlookups) [17年08月15日 16:59:18] <mira|AO> falls Du mal lokal mit ungefähr demselben Font wie dem, den ich habe, testen willst: [17年08月15日 17:01:02] <mira|AO> uxterm -fn -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1 -fw -misc-fixed-medium-r-normal-ko-18-120-100-100-c-180-iso10646-1 [17年08月15日 17:01:41] <mira|AO> aus xfonts-base [17年08月15日 17:01:50] <mfabian> Danke! [17年08月15日 17:02:54] <mira|AO> Danke auch Dir! [17年08月15日 17:03:01]
Summary of the chatlog in the last comment https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c4 in English: Thorsten and me agree that setting the width of U+3248..U+324F to 2 because the glyphs for these characters are quadratic in most fonts. (I also asked on the Unicode mailing list now whether this could be a bug in the Unicode data: http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0007.html But even if it is not a bug, setting these to 2 seems to be much better for users of terminals and that is what wcwidth in glibc is mostly used for after all). We also agree to set the width of the hexagrams U+4DC0..U+4DFF is considerably wider than single width in most fonts. In some classic Xorg fonts they are fully double width. In most scalable fonts they are somewhat narrower than double width but considerabely wider then single width. So marking them as width 1 would cause problems in terminals, even if they are not fully double width it makes sense to mark them as width 2 because they certainly won’t fit in a single character cell in a terminal. We also agree that the Hangul Jamo U+1160‥U+11FF are sort of "combining characters" although they are not marked as such in the Unicode data. But they are fragments of Hangul characters which combine. So it seems correct to mark them as width 0.
And we also agree that setting the width of the soft hyphen U+00AD to 0 as in Unicode seems to be not helpful for terminal applications and as wcwidth is mostly important for terminal applications, it makes sense to keep set the width of U+00AD to 1 as it "historically" always was in wcwidth.
(In reply to Mike FABIAN from comment #5) > [...] setting these to 2 seems to be > much better for users of terminals and that is what wcwidth > in glibc is mostly used for after all). Guys, With a huge thanks and great respect towards you working on addressing these issues, allow me please firmly oppose against deviating from the Unicode database. The width is probably indeed primarly used by terminal emulators and apps running inside. They, however, use all kinds of various sources for this data, not just glibc's wcwidth(). For example, VTE-based emulators (such as GNOME Terminal) rely on glib's g_unichar_iswide(), see [1]. Alas I don't have any usage metrics, but the poll at [2] suggest that VTE's usage share amongst terminal emulators on Linux might be somewhere in the ballpark of 50%. As for apps, if my memories are correct, I believe Vim uses its own built-in database rather than wcwidth(). So does the Joe text editor [3] (okay, it's a really marginal one), and presumably many more apps. Let alone all other non-glibc based systems with their own wcwidth() implementation that one might ssh to/from. For apps inside terminal emulators to work correctly, it's crucial that all the relevant components agree on the width. This has caused quite a headache when Unicode 9.0 changed the width of plenty of codepoints, see e.g. the bugreport with animgif at [4] (and tons of duplicates in other bugzillas and stackoverflow forums), but this is going to fade away as eventually everyone's upgrading their Unicode version. You cannot, however, reasonably assume that other folks out there, i.e. terminal emulators as well as applications that don't rely on wcwidth() but some other data source, or those other data sources such as glib and probably a whole lot more, are all going to apply your modifications. And then again we haven't talked about ssh'ing to/from non-glibc systems. If a certain glyph does not fit in its designated character cell, most terminal emulators will overflow it to the next cell. A slight overflow happens at way more codepoints than the ones debated now, e.g. in case of VTE and a not too large font, even the antialiasing of English letters such as 'W', 'm' overflows to the next cell. Of course I understand that the overflow in case of U+3248 "㉈" and friends is way more prominent, potentially causing the given and the subsequent glyph not to be readable at all, which is indeed bad. But causing the entire canvas's contents to fall apart is even worse. And that's what typically happens when players of the game disagree on the width, as seen e.g. again at [4]. If you'd really like to see these particular codepoints becoming double wide (which I'm also in favor of), I firmly believe this change should be made in the Unicode database first, so that eventually everyone implementing a wcwidth()-like method gets that update; rather than just glibc, resulting in a long-term disagreement between parties and in turn inevitable corruption of the entire terminal window in quite a few terminal emulators and apps. [1] https://bugzilla.gnome.org/show_bug.cgi?id=772890 [2] https://opensource.com/life/15/11/top-open-source-terminal-emulators [3] https://sourceforge.net/p/joe-editor/bugs/363/ [4] https://github.com/powerline/powerline/issues/1652
Hi Egmont, only a short response because we have FrOSCon/FrogLabs preparations and workshop until Monday: We’re not strictly speaking deviating from UCD because UCD does *not* define wcwidth. Terminal emulators use wcwidth, especially xterm uses ONLY it *and* defines it. Applications such as editors in the terminal (cf. jupp) use wcwidth or carry their own data which is prepared the same way as wcwidth (often they use a copy of xterm's code). You speak of compatibility and breaking. Strictly speaking, the switch glibc recently (two or three majors, I think) did to regenerated data *did* break applications, and this bugreport is 100% returning the glibc data to the way it was before in the places the previous change introduced bugs, while still keeping it up-to-date with recent Unicode. So, therefore, with this patch applied, less things will break than without. Outlyers like libglib (used by only one of the multitude of terminal emulators) can then import the data (and mechanism used to generate) from here. Other systems use the old wcwidth code from xterm, to which this one (with my patches applied) is compatible for all chars that did not get changed in or added to Unicode, which is the maximum compatibility and an easily to achieved, and achievable and should-be-achieved goal.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via bb6274ee1293a6bc76d9d7c889783303de181295 (commit) via c14b84baae83bfb73f7cd00ba7c24964ad1c712c (commit) via 7a79e321c6f85b204036c33d85f6b2aa794e7c76 (commit) via 267ee5d7ab57591a6b1bc2d2a010c88188427063 (commit) via 41b6f0ce85d98c62739b04863e8c38a1f4154e80 (commit) via 580be3035d2e0f479c4ac955bf719b0bf936f5cf (commit) from 038d1cafafb3094a9fbebd35f4aa8d0ebae0e55b (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=bb6274ee1293a6bc76d9d7c889783303de181295 commit bb6274ee1293a6bc76d9d7c889783303de181295 Author: Akhilesh Kumar <akhilesh.k@samsung.com> Date: Wed Aug 16 15:33:58 2017 +0530 Fix abmon for bem_ZM Until now the abbreviated month names were in English. [BZ #21960] * locales/bem_ZM (LC_TIME): Fix abmon, make it agree with CLDR. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c14b84baae83bfb73f7cd00ba7c24964ad1c712c commit c14b84baae83bfb73f7cd00ba7c24964ad1c712c Author: Akhilesh Kumar <akhilesh.k@samsung.com> Date: Wed Aug 16 18:01:53 2017 +0530 Fix country name for xh_ZA [BZ #21959] * locales/xh_ZA (LC_ADDRESS): Fix country name. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7a79e321c6f85b204036c33d85f6b2aa794e7c76 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76 Author: Thorsten Glaser <tg@mirbsd.de> Date: Fri Jul 14 14:02:50 2017 +0200 Refresh generated charmap data and ChangeLog [BZ #21750] * charmaps/UTF-8: Refresh. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=267ee5d7ab57591a6b1bc2d2a010c88188427063 commit 267ee5d7ab57591a6b1bc2d2a010c88188427063 Author: Thorsten Glaser <tg@mirbsd.de> Date: Fri Jul 14 14:02:46 2017 +0200 Resolve some historically special cases of ambiguous width [BZ #21750] * unicode-gen/utf8_gen.py (U+00AD): Set width to 1. * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0. * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2. * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=41b6f0ce85d98c62739b04863e8c38a1f4154e80 commit 41b6f0ce85d98c62739b04863e8c38a1f4154e80 Author: Thorsten Glaser <tg@mirbsd.de> Date: Fri Jul 14 14:02:44 2017 +0200 Handle more cases of combining characters [BZ #21750] * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=580be3035d2e0f479c4ac955bf719b0bf936f5cf commit 580be3035d2e0f479c4ac955bf719b0bf936f5cf Author: Thorsten Glaser <tg@mirbsd.de> Date: Fri Jul 14 14:02:37 2017 +0200 UnicodeData has precedence over EastAsianWidth [BZ #19852] [BZ #21750] * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before UnicodeData lines so the latter have precedence; remove hack to group output by EastAsianWidth ranges. ----------------------------------------------------------------------- Summary of changes: localedata/ChangeLog | 24 + localedata/charmaps/UTF-8 |111468 +++++++++++++++++++++++++++++++++++- localedata/locales/bem_ZM | 25 +- localedata/locales/xh_ZA | 5 +- localedata/unicode-gen/utf8_gen.py | 38 +- 5 files changed, 111400 insertions(+), 160 deletions(-)
FIXED.
Thorsten Glaser does not have an assignment for glibc on file, we cannot accept his contributions until this is sorted out.
(In reply to Thorsten Glaser from comment #8) > We’re not strictly speaking deviating from UCD because UCD does *not* define > wcwidth. Well, it defines the East_Asian_Width property from which you derive wcwidth using a couple of generic rules plus a few exceptions to them. You've just (re?)added 3248..324F and a few other ranges to these exceptions, which in my eyes means that yes, you are deviating from Unicode. > Terminal emulators use wcwidth, especially xterm uses ONLY it *and* defines > it. > > Applications such as editors in the terminal (cf. jupp) use wcwidth or carry > their own data which is prepared the same way as wcwidth (often they use a > copy of xterm's code). To be more precise, xterm and a few others copy Markus Kuhn's implementation. I don't think anyone copies from xterm. This defines the 3248..324F range as ambiguous (I've checked the most recent xterm-330 and a randomly chosen ~4 year old xterm-300 – a randomly picked even older xterm-260 is different which suggests that case xterm has a long ago caught up with the changes), which, by default, means it is 1 cell wide in xterm (unless -cjk_width is specified in which case all other ambiguous ones are turned into double)... > You speak of compatibility and breaking. Strictly speaking, the switch glibc > recently (two or three majors, I think) did to regenerated data *did* break > applications, and this bugreport is 100% returning the glibc data to the way > it was before in the places the previous change introduced bugs, while still > keeping it up-to-date with recent Unicode. > > So, therefore, with this patch applied, less things will break than without. ... so I absolutely don't get why less things would be broken now. As far as I can see, with this patch you have just further broken the handling of these codepoints by deviating from Unicode and from xterm. > Outlyers like libglib (used by only one of the multitude of terminal > emulators) can then import the data (and mechanism used to generate) from > here. You really don't seriously expect that two glibc maintainers decide over a chat that they add a few exceptions to the generic rules, and "outlyers" (like glib, maybe Qt, maybe Java, maybe some other "giant" pieces of (perhaps commercial) software, maybe other libc implementations of other Unices (like Mac), maybe a whole lot more) will follow; do you?? (And on a side note... IMHO submitting a change right after someone brings up some concerns, not even giving time for a reasonable discussion, isn't really a polite thing... Especially since recently it took me about 2 years and about 10-15 pings that were left unanswered to get through a well unittested locale change, I can't understand why this hurry now.)
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 486afa6d27156665959e59b86e7aad18c3832cbe (commit) from a3fe6a20bf81ef6a97a761dac9050517e7fd7a1f (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=486afa6d27156665959e59b86e7aad18c3832cbe commit 486afa6d27156665959e59b86e7aad18c3832cbe Author: Mike FABIAN <mfabian@redhat.com> Date: Fri Aug 18 13:41:34 2017 +0200 Use the range notation in charmaps/UTF-8 for all ranges of neighbouring characters with the same width [BZ #21750] * charmaps/UTF-8: Use the range notation for all ranges of neighbouring characters with the same width. ----------------------------------------------------------------------- Summary of changes: localedata/ChangeLog | 6 + localedata/charmaps/UTF-8 |113545 +-------------------------------------------- 2 files changed, 300 insertions(+), 113251 deletions(-)
this bug report has a lot of things in it. i think each request in the original post should be split out into sep reports. other than overall discussion about keeping things in sync, it's impossible to follow discussion about specific codepoints. wrt U+00AD: https://www.cs.tut.fi/~jkorpela/shy.html
i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug 22074. feel free to take follow ups for those topics to those respective bugs so the discussion can stay focused and not get cluttered up. i haven't looked into the other codepoints raised in the original comment, so if they aren't resolved, feel free to fork them out too.
(In reply to Mike Frysinger from comment #15) > i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug > 22074. feel free to take follow ups for those topics to those respective > bugs so the discussion can stay focused and not get cluttered up. > > i haven't looked into the other codepoints raised in the original comment, > so if they aren't resolved, feel free to fork them out too. For the code points 3248..324F;A # No [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE I asked on the unicode mailing list: http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0007.html And the response makes me think that we are free to use wcwidth 2 for these in glibc if that fits our “context” best: http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0023.html > "A" means, you get to decide whether to treat these as "W" or "N" based on context. > > There's really not strong need to change an "A" towards "W", because > "A" doesn't get in your way if you decided that "W" works better for > you. > > Remember that all the EAW properties ares supposed to be "resolved" > down to W or N. For some, like Na that resolution is deterministic, > for A it is context/application dependent, but when you finally > process your data, only W(ide) or N(arrow) remain after resolution.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 2ae5be041d9ea89cdd0f37734d72051e8f773947 (commit) via af83ed5c4647bda196fc1a7efebbe8019aa83f4a (commit) from 4f3647e46e3f645c6516faa299efc6e89d520d7b (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2ae5be041d9ea89cdd0f37734d72051e8f773947 commit 2ae5be041d9ea89cdd0f37734d72051e8f773947 Author: Mike FABIAN <mfabian@redhat.com> Date: Wed Sep 6 11:19:33 2017 +0200 Improve utf8_gen.py to set the width for characters with Prepended_Concatenation_Mark property to 1 [BZ #22070] * localedata/unicode-gen/utf8_gen.py: Set the width for characters with Prepended_Concatenation_Mark property to 1 * localedata/charmaps/UTF-8: Updated using the improved script. https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af83ed5c4647bda196fc1a7efebbe8019aa83f4a commit af83ed5c4647bda196fc1a7efebbe8019aa83f4a Author: Mike FABIAN <mfabian@redhat.com> Date: Fri Aug 18 10:12:29 2017 +0200 Write all ranges of neighbouring characters with the same width using the range notation in charmaps/UTF-8 Writing ranges of neighbouring characters with the same with like this <U000E0100>...<U000E01EF> 0 in charmaps/UTF-8 is more efficient than writing many single character lines like: <U000E0100> 0 <U000E0101> 0 ... [BZ #21750] * unicode-gen/utf8_gen.py: Write all ranges of neighbouring characters with the same width using the range notation in charmaps/UTF-8. ----------------------------------------------------------------------- Summary of changes: ChangeLog | 14 + localedata/charmaps/UTF-8 | 10 +- localedata/unicode-gen/Makefile | 4 +- localedata/unicode-gen/PropList.txt | 1618 +++++++++++++++++++++++++++++++++++ localedata/unicode-gen/utf8_gen.py | 84 ++- 5 files changed, 1704 insertions(+), 26 deletions(-) create mode 100644 localedata/unicode-gen/PropList.txt
(In reply to Mike Frysinger from comment #15) > i've forked soft hyphen (U+00AD) into bug 22073 and Hangul Jamo into bug > 22074. feel free to take follow ups for those topics to those respective > bugs so the discussion can stay focused and not get cluttered up. > > i haven't looked into the other codepoints raised in the original comment, > so if they aren't resolved, feel free to fork them out too. I think there is nothing more to do in this bug here, therefore I close it as FIXED. (Copyright assignment by Thorsen Glaser is underway).
I submitted it on Wed, 6 Sep 2017 15:15:38 +0000 (UTC)