+++ This bug was initially created as a clone of Bug #21750 +++ I’ve compared the new autogenerated column width from localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() implementation from xterm (adjusted to Unicode 10.0.0) and found a few divergences (and bugs on my (MirBSD, which uses something based on xterm’s data system-wide) side, which I fixed). Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they combine on top of the preceding initial ones: U+1160‥U+11FF Change Request to glibc: force U+1160‥U+11FF to width 0. Markus Kuhn's implementation of wcwidth does this explicitly: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */ Unicode 10.0.0 chapter 18 section 6 page 713 [1] states: The Hangul Jamo block contains the most frequently used conjoining jamo. These include all of the jamo used in modern Hangul syllable blocks, as well as many of the jamo for Old Korean. The Hangul jamo are divided into three classes: choseong (leading consonants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). Each class may, in turn, consist of one to three subunits. For example, a choseong syllable-initial character may either represent a single consonant sound, or a consonant cluster consisting of two or three consonant sounds. Likewise, a jungseong syllable-peak character may represent a simple vowel sound, or a complex diphthong or triphthong with onglide or offglide sounds. Each of these complex sequences of two or three sounds is encoded as a single conjoining jamo character. Therefore, a complete Hangul syllable can always be conceived of as a single choseong followed by a single jungseong and (optionally) a single jongseong. This block also contains two invisible filler characters which act as placeholders for a missing choseong or jungseong in an incomplete syllable. These filler characters are U+115F hangul choseong filler and U+1160 hangul jungseong filler. [1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf a concrete example, these three codepoints (one from each class): ᅔ U+1154 Hangul Choseong Chitueumchieuch ᅦ U+1166 Hangul Jungseong E ᇯ U+11ef Hangul Jongseong Ieung-Khieukh will form a single grapheme that should be wcwidth of 1: ᅔᅦᇯ the tricky part is that Jungseong & Jongseong conjoin only when they follow a Choseong. so if you had a space U+0020 between each of those codepoints, you'd end up with wcwidth of 5.
(In reply to Mike Frysinger from comment #0) > > the tricky part is that Jungseong & Jongseong conjoin only when they follow > a Choseong. so if you had a space U+0020 between each of those codepoints, > you'd end up with wcwidth of 5. But we cannot really do anything about the context, so we have to decide for one width to use.
Aren't Korean chars usually full width? I.e. wcwidth 2. Troy On Sun, 2017-09-03 at 21:01 +0000, vapier at gentoo dot org wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=22074 > > Bug ID: 22074 > Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul > Jungseong and Jongseong) should be 0 > Product: glibc > Version: 2.26 > Status: NEW > Severity: normal > Priority: P2 > Component: localedata > Assignee: unassigned at sourceware dot org > Reporter: vapier at gentoo dot org > CC: egmont at gmail dot com, libc-locales at sourceware dot org, > maiku.fabian at gmail dot com, tg at mirbsd dot de > Depends on: 21750 > Target Milestone: --- > > +++ This bug was initially created as a clone of Bug #21750 +++ > > I’ve compared the new autogenerated column width from > localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() > implementation from xterm (adjusted to Unicode 10.0.0) and found a few > divergences (and bugs on my (MirBSD, which uses something based on xterm’s data > system-wide) side, which I fixed). > > Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they > combine on top of the preceding initial ones: U+1160‥U+11FF > > Change Request to glibc: force U+1160‥U+11FF to width 0. > > Markus Kuhn's implementation of wcwidth does this explicitly: > https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c > /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */ > > Unicode 10.0.0 chapter 18 section 6 page 713 [1] states: > The Hangul Jamo block contains the most frequently used conjoining jamo. These > include all of the jamo used in modern Hangul syllable blocks, as well as many > of the jamo for Old Korean. The Hangul jamo are divided into three classes: > choseong (leading consonants, or syllable-initial characters), jungseong > (vowels, or syllable-peak characters), and jongseong (trailing consonants, or > syllable-final characters). Each class may, in turn, consist of one to three > subunits. For example, a choseong syllable-initial character may either > represent a single consonant sound, or a consonant cluster consisting of two or > three consonant sounds. Likewise, a jungseong syllable-peak character may > represent a simple vowel sound, or a complex diphthong or triphthong with > onglide or offglide sounds. Each of these complex sequences of two or three > sounds is encoded as a single conjoining jamo character. Therefore, a complete > Hangul syllable can always be conceived of as a single choseong followed by a > single jungseong and (optionally) a single jongseong. This block also contains > two invisible filler characters which act as placeholders for a missing > choseong or jungseong in an incomplete syllable. These filler characters are > U+115F hangul choseong filler and U+1160 hangul jungseong filler. > > [1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf > > a concrete example, these three codepoints (one from each class): > ᅔ U+1154 Hangul Choseong Chitueumchieuch > ᅦ U+1166 Hangul Jungseong E > ᇯ U+11ef Hangul Jongseong Ieung-Khieukh > > will form a single grapheme that should be wcwidth of 1: > ᅔᅦᇯ > > the tricky part is that Jungseong & Jongseong conjoin only when they follow a > Choseong. so if you had a space U+0020 between each of those codepoints, you'd > end up with wcwidth of 5. > > > Referenced Bugs: > > https://sourceware.org/bugzilla/show_bug.cgi?id=21750 > [Bug 21750] column width of characters incompatible with classical wcwidth
(In reply to Troy Korjuslommi from comment #2) > Aren't Korean chars usually full width? I.e. wcwidth 2. > > Troy Yes, but the characters we are discussing here are *parts* of Korean characters, not *whole* Korean characters. See Mike Frysinger’s example in comment#0.
(In reply to Mike FABIAN from comment #1) > But we cannot really do anything about the context, so we have to decide > for one width to use. today that is true. long term, i think we'll need to figure out something better similar to UAX29. it wouldn't help with wcwidth, but it would for wcswidth. http://www.unicode.org/reports/tr29/ so i think for today, supporting the common case and not the degenerate/invalid cases makes sense. which is to say, marking them as wcwidth of 0. (In reply to Troy Korjuslommi from comment #2) > Aren't Korean chars usually full width? I.e. wcwidth 2. the precomposed hangul forms are wcwidth of 2: https://en.wikipedia.org/wiki/Hangul_Syllables so i guess not only would we want to change the two sets to 0, we'd want to change the first set to 2. that way we'd line up with the precomposed forms better. i think that should get us closer, but it'd still be an approximation. maybe we should just focus on wcswidth here ?
(In reply to Mike Frysinger from comment #4) > (In reply to Mike FABIAN from comment #1) > (In reply to Troy Korjuslommi from comment #2) > > Aren't Korean chars usually full width? I.e. wcwidth 2. > > the precomposed hangul forms are wcwidth of 2: > https://en.wikipedia.org/wiki/Hangul_Syllables > > so i guess not only would we want to change the two sets to 0, we'd want to > change the first set to 2. that way we'd line up with the precomposed forms > better. I think we already set the precomposed hangul to width 2 because of this line AC00..D7A3;W # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH in EastAsianWidth.txt we have <UAC00>...<UD7A3> 2 in charmaps/UTF-8 in the WIDTH section.
So this bug is fixed, isn’t it? Because in glibc master in charmaps/UTF-8, we have the precomposed hangul with width 2 and the hangul jamo with width 0.
I was referring to the "should be wcwidth of 1" comment, which doesn't seem to be correct. My point was that the wcwidth should be either 0 or two. Troy On Mon, 2017-09-04 at 08:54 +0000, maiku.fabian at gmail dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=22074 > > --- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> --- > (In reply to Troy Korjuslommi from comment #2) > > Aren't Korean chars usually full width? I.e. wcwidth 2. > > > > Troy > > Yes, but the characters we are discussing here are *parts* > of Korean characters, not *whole* Korean characters. > > See Mike Frysinger’s example in comment#0. >
I think this is fixed: https://sourceware.org/bugzilla/show_bug.cgi?id=22074#c6