[Bug localedata/22074] New: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0
Troy Korjuslommi
tjk@tksoft.com
Mon Sep 4 08:10:00 GMT 2017
Aren't Korean chars usually full width? I.e. wcwidth 2.
Troy
On Sun, 2017-09-03 at 21:01 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22074
>
> Bug ID: 22074
> Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul
> Jungseong and Jongseong) should be 0
> Product: glibc
> Version: 2.26
> Status: NEW
> Severity: normal
> Priority: P2
> Component: localedata
> Assignee: unassigned at sourceware dot org
> Reporter: vapier at gentoo dot org
> CC: egmont at gmail dot com, libc-locales at sourceware dot org,
> maiku.fabian at gmail dot com, tg at mirbsd dot de
> Depends on: 21750
> Target Milestone: ---
>
> +++ This bug was initially created as a clone of Bug #21750 +++
>
> Iâve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xtermâs data
> system-wide) side, which I fixed).
>
> Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160â¥U+11FF
>
> Change Request to glibc: force U+1160â¥U+11FF to width 0.
>
> Markus Kuhn's implementation of wcwidth does this explicitly:
> https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
> /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
>
> Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
> The Hangul Jamo block contains the most frequently used conjoining jamo. These
> include all of the jamo used in modern Hangul syllable blocks, as well as many
> of the jamo for Old Korean. The Hangul jamo are divided into three classes:
> choseong (leading consonants, or syllable-initial characters), jungseong
> (vowels, or syllable-peak characters), and jongseong (trailing consonants, or
> syllable-final characters). Each class may, in turn, consist of one to three
> subunits. For example, a choseong syllable-initial character may either
> represent a single consonant sound, or a consonant cluster consisting of two or
> three consonant sounds. Likewise, a jungseong syllable-peak character may
> represent a simple vowel sound, or a complex diphthong or triphthong with
> onglide or offglide sounds. Each of these complex sequences of two or three
> sounds is encoded as a single conjoining jamo character. Therefore, a complete
> Hangul syllable can always be conceived of as a single choseong followed by a
> single jungseong and (optionally) a single jongseong. This block also contains
> two invisible filler characters which act as placeholders for a missing
> choseong or jungseong in an incomplete syllable. These filler characters are
> U+115F hangul choseong filler and U+1160 hangul jungseong filler.
>
> [1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf
>
> a concrete example, these three codepoints (one from each class):
> á
U+1154 Hangul Choseong Chitueumchieuch
> á
¦ U+1166 Hangul Jungseong E
> ᯠU+11ef Hangul Jongseong Ieung-Khieukh
>
> will form a single grapheme that should be wcwidth of 1:
> á
á
¦á¯
>
> the tricky part is that Jungseong & Jongseong conjoin only when they follow a
> Choseong. so if you had a space U+0020 between each of those codepoints, you'd
> end up with wcwidth of 5.
>
>
> Referenced Bugs:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth
More information about the Libc-locales
mailing list