This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/22074] New: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0
- From: "vapier at gentoo dot org" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Sun, 03 Sep 2017 21:01:07 +0000
- Subject: [Bug localedata/22074] New: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=22074
Bug ID: 22074
Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul
Jungseong and Jongseong) should be 0
Product: glibc
Version: 2.26
Status: NEW
Severity: normal
Priority: P2
Component: localedata
Assignee: unassigned at sourceware dot org
Reporter: vapier at gentoo dot org
CC: egmont at gmail dot com, libc-locales at sourceware dot org,
maiku.fabian at gmail dot com, tg at mirbsd dot de
Depends on: 21750
Target Milestone: ---
+++ This bug was initially created as a clone of Bug #21750 +++
I’ve compared the new autogenerated column width from
localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
implementation from xterm (adjusted to Unicode 10.0.0) and found a few
divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
system-wide) side, which I fixed).
Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
combine on top of the preceding initial ones: U+1160‥U+11FF
Change Request to glibc: force U+1160‥U+11FF to width 0.
Markus Kuhn's implementation of wcwidth does this explicitly:
https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
/* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
The Hangul Jamo block contains the most frequently used conjoining jamo. These
include all of the jamo used in modern Hangul syllable blocks, as well as many
of the jamo for Old Korean. The Hangul jamo are divided into three classes:
choseong (leading consonants, or syllable-initial characters), jungseong
(vowels, or syllable-peak characters), and jongseong (trailing consonants, or
syllable-final characters). Each class may, in turn, consist of one to three
subunits. For example, a choseong syllable-initial character may either
represent a single consonant sound, or a consonant cluster consisting of two or
three consonant sounds. Likewise, a jungseong syllable-peak character may
represent a simple vowel sound, or a complex diphthong or triphthong with
onglide or offglide sounds. Each of these complex sequences of two or three
sounds is encoded as a single conjoining jamo character. Therefore, a complete
Hangul syllable can always be conceived of as a single choseong followed by a
single jungseong and (optionally) a single jongseong. This block also contains
two invisible filler characters which act as placeholders for a missing
choseong or jungseong in an incomplete syllable. These filler characters are
U+115F hangul choseong filler and U+1160 hangul jungseong filler.
[1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf
a concrete example, these three codepoints (one from each class):
ᅔ U+1154 Hangul Choseong Chitueumchieuch
ᅦ U+1166 Hangul Jungseong E
ᇯ U+11ef Hangul Jongseong Ieung-Khieukh
will form a single grapheme that should be wcwidth of 1:
ᅔᅦᇯ
the tricky part is that Jungseong & Jongseong conjoin only when they follow a
Choseong. so if you had a space U+0020 between each of those codepoints, you'd
end up with wcwidth of 5.
Referenced Bugs:
https://sourceware.org/bugzilla/show_bug.cgi?id=21750
[Bug 21750] column width of characters incompatible with classical wcwidth
--
You are receiving this mail because:
You are on the CC list for the bug.