This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
[Bug localedata/21750] column width of characters incompatible with classical wcwidth
- From: "tjk at tksoft dot com" <sourceware-bugzilla at sourceware dot org>
- To: libc-locales at sourceware dot org
- Date: Wed, 12 Jul 2017 11:01:43 +0000
- Subject: [Bug localedata/21750] column width of characters incompatible with classical wcwidth
- Auto-submitted: auto-generated
- References: <bug-21750-716@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=21750
--- Comment #1 from Troy Korjuslommi <tjk at tksoft dot com> ---
Excuse my ignorance, but isn't U+00AD (soft hyphen) usually invisible,
i.e. zero columns? If an app breaks up words at end-of-line, it can use
the soft hyphens as helpers to detect the correct locations. The app can
then add a visible hyphen to the end of the line. (If the app also reads
from the terminal, then it can e.g. ignore visible hyphens when preceded
by a soft hyphen, or use some other mechanism to mark the character as
for terminal display only).
I am not suggesting a change, if xterm etc. multitude of apps are
already handling soft hyphens in some other manner, just wondering.
Troy
On Tue, 2017-07-11 at 14:18 +0000, tg at mirbsd dot de wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
>
> Bug ID: 21750
> Summary: column width of characters incompatible with classical
> wcwidth
> Product: glibc
> Version: 2.26
> Status: UNCONFIRMED
> Severity: normal
> Priority: P2
> Component: localedata
> Assignee: unassigned at sourceware dot org
> Reporter: tg at mirbsd dot de
> CC: libc-locales at sourceware dot org
> Target Milestone: ---
>
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
>
> 1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
>
> Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
>
> Change Request to glibc: force U+00AD to width 1.
>
> 2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class:
> U+0CBF U+0CC6 U-00011C3F
>
> This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same
> as Cf/NSM, which I do.
>
> Change Request to glibc: handle Me/Mn category the same as NSM bidi class.
>
> 3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160‥U+11FF
>
> Change Request to glibc: force U+1160‥U+11FF to width 0.
>
> 4. During parsing, EastAsianWidth data overrides UCD data, more specifically
> the NSM property.
>
> This leads to U+302A‥U+302D and – see also
> https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being
> treated as width 2.
>
> Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.
>
> 5. Ambiguous circled numbers and neutral hexagrams changed width
>
> xterm used to set those to width 2, likely because they are ideographs and not
> unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD
> nowadays)
>
> Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.
>
>
> Note: I’ve initially reported the surprising change to Debian as
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the
> research today (against 2.24 in Debian and git master commit
> 2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10,
> double-checked *all* differences against MirBSD code and fixed a few bugs there
> after making it possible to compare the results (considering glibc only puts
> actually assigned codepoints into the localedata/charmaps/UTF-8 file).
>
> Rationale for requesting the change in glibc is so that all systems I have
> access to use the same width data, preventing display artifacts and glitches up
> to making an editor somewhat unusable with heavy Unicode (I have test files
> containing the entire Unicode range). Thank you for listening.
>
> If necessary, I will provide patches (to utf8_gen.py most likely) when asked.
>
--
You are receiving this mail because:
You are on the CC list for the bug.