This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/22074] New: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0


https://sourceware.org/bugzilla/show_bug.cgi?id=22074

            Bug ID: 22074
           Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul
                    Jungseong and Jongseong) should be 0
           Product: glibc
           Version: 2.26
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: vapier at gentoo dot org
                CC: egmont at gmail dot com, libc-locales at sourceware dot org,
                    maiku.fabian at gmail dot com, tg at mirbsd dot de
        Depends on: 21750
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #21750 +++

I’ve compared the new autogenerated column width from
localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
implementation from xterm (adjusted to Unicode 10.0.0) and found a few
divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
system-wide) side, which I fixed).

Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
combine on top of the preceding initial ones: U+1160‥U+11FF

Change Request to glibc: force U+1160‥U+11FF to width 0.

Markus Kuhn's implementation of wcwidth does this explicitly:
  https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
  /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */

Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
The Hangul Jamo block contains the most frequently used conjoining jamo. These
include all of the jamo used in modern Hangul syllable blocks, as well as many
of the jamo for Old Korean. The Hangul jamo are divided into three classes:
choseong (leading consonants, or syllable-initial characters), jungseong
(vowels, or syllable-peak characters), and jongseong (trailing consonants, or
syllable-final characters). Each class may, in turn, consist of one to three
subunits. For example, a choseong syllable-initial character may either
represent a single consonant sound, or a consonant cluster consisting of two or
three consonant sounds. Likewise, a jungseong syllable-peak character may
represent a simple vowel sound, or a complex diphthong or triphthong with
onglide or offglide sounds. Each of these complex sequences of two or three
sounds is encoded as a single conjoining jamo character. Therefore, a complete
Hangul syllable can always be conceived of as a single choseong followed by a
single jungseong and (optionally) a single jongseong. This block also contains
two invisible filler characters which act as placeholders for a missing
choseong or jungseong in an incomplete syllable. These filler characters are
U+115F hangul choseong filler and U+1160 hangul jungseong filler.

[1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf

a concrete example, these three codepoints (one from each class):
  ᅔ U+1154 Hangul Choseong Chitueumchieuch
  ᅦ U+1166 Hangul Jungseong E
  ᇯ U+11ef Hangul Jongseong Ieung-Khieukh

will form a single grapheme that should be wcwidth of 1:
  ᅔᅦᇯ

the tricky part is that Jungseong & Jongseong conjoin only when they follow a
Choseong.  so if you had a space U+0020 between each of those codepoints, you'd
end up with wcwidth of 5.


Referenced Bugs:

https://sourceware.org/bugzilla/show_bug.cgi?id=21750
[Bug 21750] column width of characters incompatible with classical wcwidth
-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]