[Bug localedata/22074] New: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0

Troy Korjuslommi tjk@tksoft.com
Mon Sep 4 08:10:00 GMT 2017


Aren't Korean chars usually full width? I.e. wcwidth 2.

Troy



On Sun, 2017-09-03 at 21:01 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22074
> 
>             Bug ID: 22074
>            Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul
>                     Jungseong and Jongseong) should be 0
>            Product: glibc
>            Version: 2.26
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: vapier at gentoo dot org
>                 CC: egmont at gmail dot com, libc-locales at sourceware dot org,
>                     maiku.fabian at gmail dot com, tg at mirbsd dot de
>         Depends on: 21750
>   Target Milestone: ---
> 
> +++ This bug was initially created as a clone of Bug #21750 +++
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160‥U+11FF
> 
> Change Request to glibc: force U+1160‥U+11FF to width 0.
> 
> Markus Kuhn's implementation of wcwidth does this explicitly:
>   https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>   /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
> 
> Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
> The Hangul Jamo block contains the most frequently used conjoining jamo. These
> include all of the jamo used in modern Hangul syllable blocks, as well as many
> of the jamo for Old Korean. The Hangul jamo are divided into three classes:
> choseong (leading consonants, or syllable-initial characters), jungseong
> (vowels, or syllable-peak characters), and jongseong (trailing consonants, or
> syllable-final characters). Each class may, in turn, consist of one to three
> subunits. For example, a choseong syllable-initial character may either
> represent a single consonant sound, or a consonant cluster consisting of two or
> three consonant sounds. Likewise, a jungseong syllable-peak character may
> represent a simple vowel sound, or a complex diphthong or triphthong with
> onglide or offglide sounds. Each of these complex sequences of two or three
> sounds is encoded as a single conjoining jamo character. Therefore, a complete
> Hangul syllable can always be conceived of as a single choseong followed by a
> single jungseong and (optionally) a single jongseong. This block also contains
> two invisible filler characters which act as placeholders for a missing
> choseong or jungseong in an incomplete syllable. These filler characters are
> U+115F hangul choseong filler and U+1160 hangul jungseong filler.
> 
> [1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf
> 
> a concrete example, these three codepoints (one from each class):
>   á
” U+1154 Hangul Choseong Chitueumchieuch
>   á
¦ U+1166 Hangul Jungseong E
>   ᇯ U+11ef Hangul Jongseong Ieung-Khieukh
> 
> will form a single grapheme that should be wcwidth of 1:
>   á
Ӈ
¦á‡¯
> 
> the tricky part is that Jungseong & Jongseong conjoin only when they follow a
> Choseong.  so if you had a space U+0020 between each of those codepoints, you'd
> end up with wcwidth of 5.
> 
> 
> Referenced Bugs:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth





More information about the Libc-locales mailing list