22074 – charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0

Bug 22074 - charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) should be 0

Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul Jungseong and Jongseong) sh...

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.26

Importance:	P2 normal
Target Milestone:	2.27
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:	21750
Blocks:
	Show dependency tree / graph

Reported:	2017-09-03 21:01 UTC by Mike Frysinger
Modified:	2017-09-11 20:09 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Frysinger 2017-09-03 21:01:07 UTC

+++ This bug was initially created as a clone of Bug #21750 +++

I’ve compared the new autogenerated column width from localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth() implementation from xterm (adjusted to Unicode 10.0.0) and found a few divergences (and bugs on my (MirBSD, which uses something based on xterm’s data system-wide) side, which I fixed).

Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they combine on top of the preceding initial ones: U+1160‥U+11FF

Change Request to glibc: force U+1160‥U+11FF to width 0.

Markus Kuhn's implementation of wcwidth does this explicitly:
  https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
  /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */

Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
The Hangul Jamo block contains the most frequently used conjoining jamo. These include all of the jamo used in modern Hangul syllable blocks, as well as many of the jamo for Old Korean. The Hangul jamo are divided into three classes: choseong (leading consonants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). Each class may, in turn, consist of one to three subunits. For example, a choseong syllable-initial character may either represent a single consonant sound, or a consonant cluster consisting of two or three consonant sounds. Likewise, a jungseong syllable-peak character may represent a simple vowel sound, or a complex diphthong or triphthong with onglide or offglide sounds. Each of these complex sequences of two or three sounds is encoded as a single conjoining jamo character. Therefore, a complete Hangul syllable can always be conceived of as a single choseong followed by a single jungseong and (optionally) a single jongseong. This block also contains two invisible filler characters which act as placeholders for a missing choseong or jungseong in an incomplete syllable. These filler characters are U+115F hangul choseong filler and U+1160 hangul jungseong filler.

[1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf

a concrete example, these three codepoints (one from each class):
  ᅔ U+1154 Hangul Choseong Chitueumchieuch
  ᅦ U+1166 Hangul Jungseong E
  ᇯ U+11ef Hangul Jongseong Ieung-Khieukh

will form a single grapheme that should be wcwidth of 1:
  ᅔᅦᇯ

the tricky part is that Jungseong & Jongseong conjoin only when they follow a Choseong.  so if you had a space U+0020 between each of those codepoints, you'd end up with wcwidth of 5.

Comment 1 Mike FABIAN 2017-09-04 08:10:02 UTC

(In reply to Mike Frysinger from comment #0)
> 
> the tricky part is that Jungseong & Jongseong conjoin only when they follow
> a Choseong.  so if you had a space U+0020 between each of those codepoints,
> you'd end up with wcwidth of 5.

But we cannot really do anything about the context, so we have to decide
for one width to use.

Comment 2 Troy Korjuslommi 2017-09-04 08:10:34 UTC

Aren't Korean chars usually full width? I.e. wcwidth 2.

Troy



On Sun, 2017-09-03 at 21:01 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22074
> 
>             Bug ID: 22074
>            Summary: charmaps/UTF-8: wcwidth for U+1160-U+11FF (Hangul
>                     Jungseong and Jongseong) should be 0
>            Product: glibc
>            Version: 2.26
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: vapier at gentoo dot org
>                 CC: egmont at gmail dot com, libc-locales at sourceware dot org,
>                     maiku.fabian at gmail dot com, tg at mirbsd dot de
>         Depends on: 21750
>   Target Milestone: ---
> 
> +++ This bug was initially created as a clone of Bug #21750 +++
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
> combine on top of the preceding initial ones: U+1160‥U+11FF
> 
> Change Request to glibc: force U+1160‥U+11FF to width 0.
> 
> Markus Kuhn's implementation of wcwidth does this explicitly:
>   https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>   /* generated by "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c" */
> 
> Unicode 10.0.0 chapter 18 section 6 page 713 [1] states:
> The Hangul Jamo block contains the most frequently used conjoining jamo. These
> include all of the jamo used in modern Hangul syllable blocks, as well as many
> of the jamo for Old Korean. The Hangul jamo are divided into three classes:
> choseong (leading consonants, or syllable-initial characters), jungseong
> (vowels, or syllable-peak characters), and jongseong (trailing consonants, or
> syllable-final characters). Each class may, in turn, consist of one to three
> subunits. For example, a choseong syllable-initial character may either
> represent a single consonant sound, or a consonant cluster consisting of two or
> three consonant sounds. Likewise, a jungseong syllable-peak character may
> represent a simple vowel sound, or a complex diphthong or triphthong with
> onglide or offglide sounds. Each of these complex sequences of two or three
> sounds is encoded as a single conjoining jamo character. Therefore, a complete
> Hangul syllable can always be conceived of as a single choseong followed by a
> single jungseong and (optionally) a single jongseong. This block also contains
> two invisible filler characters which act as placeholders for a missing
> choseong or jungseong in an incomplete syllable. These filler characters are
> U+115F hangul choseong filler and U+1160 hangul jungseong filler.
> 
> [1] http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf
> 
> a concrete example, these three codepoints (one from each class):
>   ᅔ U+1154 Hangul Choseong Chitueumchieuch
>   ᅦ U+1166 Hangul Jungseong E
>   ᇯ U+11ef Hangul Jongseong Ieung-Khieukh
> 
> will form a single grapheme that should be wcwidth of 1:
>   ᅔᅦᇯ
> 
> the tricky part is that Jungseong & Jongseong conjoin only when they follow a
> Choseong.  so if you had a space U+0020 between each of those codepoints, you'd
> end up with wcwidth of 5.
> 
> 
> Referenced Bugs:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth

Comment 3 Mike FABIAN 2017-09-04 08:54:04 UTC

(In reply to Troy Korjuslommi from comment #2)
> Aren't Korean chars usually full width? I.e. wcwidth 2.
> 
> Troy

Yes, but the characters we are discussing here are *parts*
of Korean characters, not *whole* Korean characters.

See Mike Frysinger’s example in comment#0.

Comment 4 Mike Frysinger 2017-09-05 14:54:41 UTC

(In reply to Mike FABIAN from comment #1)
> But we cannot really do anything about the context, so we have to decide
> for one width to use.

today that is true.  long term, i think we'll need to figure out something better similar to UAX29.  it wouldn't help with wcwidth, but it would for wcswidth.
  http://www.unicode.org/reports/tr29/

so i think for today, supporting the common case and not the degenerate/invalid cases makes sense.  which is to say, marking them as wcwidth of 0.

(In reply to Troy Korjuslommi from comment #2)
> Aren't Korean chars usually full width? I.e. wcwidth 2.

the precomposed hangul forms are wcwidth of 2:
  https://en.wikipedia.org/wiki/Hangul_Syllables

so i guess not only would we want to change the two sets to 0, we'd want to change the first set to 2.  that way we'd line up with the precomposed forms better.

i think that should get us closer, but it'd still be an approximation.  maybe we should just focus on wcswidth here ?

Comment 5 Mike FABIAN 2017-09-06 06:43:59 UTC

(In reply to Mike Frysinger from comment #4)
> (In reply to Mike FABIAN from comment #1)

> (In reply to Troy Korjuslommi from comment #2)
> > Aren't Korean chars usually full width? I.e. wcwidth 2.
> 
> the precomposed hangul forms are wcwidth of 2:
>   https://en.wikipedia.org/wiki/Hangul_Syllables
> 
> so i guess not only would we want to change the two sets to 0, we'd want to
> change the first set to 2.  that way we'd line up with the precomposed forms
> better.

I think we already set the precomposed hangul to width 2 because of this line

AC00..D7A3;W     # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH

in EastAsianWidth.txt

we have

<UAC00>...<UD7A3>       2

in charmaps/UTF-8 in the WIDTH section.

Comment 6 Mike FABIAN 2017-09-06 13:14:53 UTC

So this bug is fixed, isn’t it?

Because in glibc master in charmaps/UTF-8, we have the
precomposed hangul with width 2 and the hangul jamo with width 0.

Comment 7 Troy Korjuslommi 2017-09-11 11:24:39 UTC

I was referring to the "should be wcwidth of 1" comment, which doesn't
seem to be correct. My point was that the wcwidth should  be either 0 or
two. 

Troy


On Mon, 2017-09-04 at 08:54 +0000, maiku.fabian at gmail dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22074
> 
> --- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
> (In reply to Troy Korjuslommi from comment #2)
> > Aren't Korean chars usually full width? I.e. wcwidth 2.
> > 
> > Troy
> 
> Yes, but the characters we are discussing here are *parts*
> of Korean characters, not *whole* Korean characters.
> 
> See Mike Frysinger’s example in comment#0.
>

Comment 8 Mike FABIAN 2017-09-11 20:08:53 UTC

I think this is fixed:

https://sourceware.org/bugzilla/show_bug.cgi?id=22074#c6