26120 – column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Bug 26120 - column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should be 0)

Summary: column width of of some Korean JUNGSEONG/JONGSEONG characters wrong (should ...

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.31

Importance:	P2 normal
Target Milestone:	2.32
Assignee:	Mike FABIAN

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-06-16 05:43 UTC by Mike FABIAN
Modified:	2021-12-30 00:35 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch (698 bytes, patch) 2020-06-16 08:26 UTC, Mike FABIAN	Details \| Diff
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch (976 bytes, patch) 2020-06-21 09:00 UTC, Mike FABIAN	Details \| Diff
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch (975 bytes, patch) 2020-06-23 09:03 UTC, Mike FABIAN	Details \| Diff
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch (1.85 KB, patch) 2020-06-25 13:05 UTC, Mike FABIAN	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mike FABIAN 2020-06-16 05:43:43 UTC

Robert Ross <rob.ross@ymail.com> writes:

> Thank you for maintaining glibc's "localedata/charmaps/UTF-8".  It is
> good that most "HANGUL JUNGSEONG" characters have zero width due to
> "<U1160>...<U11FF> 0" on line 48775 but strange that the newer "HANGUL
> JUNGSEONG" characters have width 1 since there is no
> "<UD7B0>...<UD7C6> 0".  Similarly most "HANGUL JONGSEONG" characters
> have width 0 due to line 48775 but the newer ones have width 1 since
> there is no "<UD7CB>...<UD7FB> 0".  Please correct this if it's an
> error or explain if it's not.

In https://www.unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt all of these
have width "N".

http://www.unicode.org/reports/tr11/ says:

6.2 Combining Marks

> Combining marks have been classified and are given a property
> assignment based on their typical applicability. For example,
> combining marks typically applied to characters of class N, Na, or W
> are classified as A. Combining marks for purely non-East Asian scripts
> are marked as N, and nonspacing marks used only with wide characters
> are given a W. Even more so than for other characters, the
> East_Asian_Width property for combining marks is not the same as their
> display width.
> 
> In particular, nonspacing marks do not possess actual advance
> width. Therefore, even when displaying combining marks, the
> East_Asian_Width property cannot be related to the advance width of
> these characters. However, it can be useful in determining the
> encoding length in a legacy encoding, or the choice of font for the
> range of characters including that nonspacing mark. The width of the
> glyph image of a nonspacing mark should always be chosen as the
> appropriate one for the width of the base character.

See also: https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c5

> We also agree that the Hangul Jamo U+1160‥U+11FF are sort
> of "combining characters" although they are not marked as such
> in the Unicode data. But they are fragments of Hangul characters
> which combine. So it seems correct to mark them as width 0.

Comment 1 Mike FABIAN 2020-06-16 05:47:36 UTC

So I think it is best to set all JUNGSEONG/JONGSEONG characters to width 0.

Comment 2 Mike FABIAN 2020-06-16 05:58:07 UTC

Some information from a chat with Thorsten Glaser (in German):

<mfabian> Alles was  JUNGSEONG oder JONGSEONG im Namen hat, ist so ein Combining
          Character? [20年06月15日 21:38:17]
<MirWarm> soweit ich das verstanden habe, sind die koreanischen zeichen immer
          choseong + j{u,o}ngseong [20年06月15日 21:54:15]
<MirWarm>  The Hangul jamo are divided into three classes: choseong (Leading
          consonants), jungseong (Vowels) and jongseong (Trailing consonants)
          which in the rest of this write-up will be referred to as L, V and T.
                                                          [20年06月15日 21:58:54]
<MirWarm> A standard Hangul syllable is composed as (L+V+T*)
                                                          [20年06月15日 21:58:55]
<MirWarm> ah, ja [20年06月15日 21:58:57]
<MirWarm> also die choseong sind wohl nicht required im koreanischen Skript,
          aber in Unicode wohl, man muß dann mit U+115F anfangen
                                                          [20年06月15日 21:59:24]
<MirWarm> choseong ist initial (C), jungseong ist medial (G) und nucleus (V),
          jongseong ist coda (K) [20年06月15日 22:00:15]
<MirWarm> und koreanische silbenwörter sind (C)(G)V(K) [20年06月15日 22:00:27]
<MirWarm> und in Unicode nimmt man U+115F, wenn C fehlt [20年06月15日 22:00:53]
<MirWarm> 115F ist 1, die anderen sind 0 [20年06月15日 22:01:06]
<MirWarm> paßt [20年06月15日 22:01:07]
<MirWarm> bin in ~5 minuten wieder da [20年06月15日 22:01:14]
*** MirWarm (~mird@2001-4dd7-dca-0-21f-3bff-fe0d-cbb1.ipv6dyn.netcologne.de) has
    quit: Quit: using sirc version 2.211-MirDebian-20181124-1+ssfe (RANDOM=2406)
                                                          [20年06月15日 22:01:15]
*** MirWarm (~mird@x61e.mirbsd.org) has joined channel #mirbsd
                                                          [20年06月15日 22:06:44]
<MirWarm> re [20年06月15日 22:07:05]
<MirWarm> ich mach bei mir dann gleich mal D7B0 .. D7FF noch auf 0
                                                          [20年06月15日 22:08:33]
<MirWarm> so, committed [20年06月15日 22:31:19]

Comment 3 Mike FABIAN 2020-06-16 08:26:18 UTC

Created attachment 12623 [details]
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

Comment 4 Florian Weimer 2020-06-16 11:53:23 UTC

Does gnulib need updating as well?

Comment 5 Mike FABIAN 2020-06-16 17:32:39 UTC

(In reply to Florian Weimer from comment #4)
> Does gnulib need updating as well?

I don’t know. Does gnulib have width data?

Comment 6 Florian Weimer 2020-06-16 17:35:54 UTC

Yes, I think it's here:

http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=c760ad33183418a8f103152ff43d57fabbc3949d;hb=HEAD

Comment 7 Thorsten Glaser 2020-06-20 21:19:23 UTC

Erk… glibc is particular about not defining widths of not-defined characters.

Besides D7FC‥D7FF (which gave me an error in the output from my own scripts), D7C7‥D7CA are not yet assigned and so probably need to be excluded in glibc.

Should they ever be defined, we’ll need to adjust here, so it’s probably better to iterate over the entire D7C0‥D7FF range and ony change widths for defined codepoints from the current UCD version.

Comment 8 Mike FABIAN 2020-06-21 09:00:55 UTC

Created attachment 12629 [details]
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

Updated patch to ommit the unassigned characters.

Comment 9 Mike FABIAN 2020-06-21 09:07:02 UTC

(In reply to Thorsten Glaser from comment #7)
> Erk… glibc is particular about not defining widths of not-defined characters.
> 
> Besides D7FC‥D7FF (which gave me an error in the output from my own
> scripts), D7C7‥D7CA are not yet assigned and so probably need to be excluded
> in glibc.
> 
> Should they ever be defined, we’ll need to adjust here, so it’s probably
> better to iterate over the entire D7C0‥D7FF range and ony change widths for
> defined codepoints from the current UCD version.

Thank you for noticing that!

I was aware that glibc has a problem with defining width of unassigned
characters, therefore I used 

 for key in list(range(0xD7B0, 0xD7FC)):

instead of 

 for key in list(range(0xD7B0, 0xD800)):

because D7FC and D7FF are undefined and localedef gave me errors
when I included them. Surprisingly localedef did not give  errors for the unassigned D7C7‥D7CA ...

I had checked the range manually and thought all characters
from D7B0 to D7FB were assigned, but apparently I missed D7C7‥D7CA.

I improved the generator script a bit to omit the unassigned characters,
if these get defined in future, the script would add them.

Comment 10 Thorsten Glaser 2020-06-21 14:20:22 UTC

Looks okay (but now you can use 0xD800 in the range call), this is similar to what I did in my script http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eaw2glibc that postprocesses the width output I normally use (script http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eawparse and http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=HEAD contains an example of its output) into glibc-compatible format.

The output I get (for UCD 13.0.0) is identical to yours.

Comment 11 Mike FABIAN 2020-06-23 07:08:49 UTC

(In reply to Thorsten Glaser from comment #10)
> Looks okay (but now you can use 0xD800 in the range call), 

Yes, I could. But if 0xD7FE and 0xD7FF ever get assigned, 
would they be characters of the same type? I would have to check 
that manually anyway.

> The output I get (for UCD 13.0.0) is identical to yours.

Great!

Comment 12 Thorsten Glaser 2020-06-23 07:33:35 UTC

According to Blocks.txt, yes. Unicode does assign characters to blocks.

Comment 13 Mike FABIAN 2020-06-23 08:50:24 UTC

(In reply to Thorsten Glaser from comment #12)
> According to Blocks.txt, yes. Unicode does assign characters to blocks.

D7B0..D7FF; Hangul Jamo Extended-B

I think you are right, I’ll change the script to end the range at the end of that block, that seems more likely to be correct if these characters ever get assigned.

Comment 14 Mike FABIAN 2020-06-23 09:03:18 UTC

Created attachment 12651 [details]
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

End the range at 0xD7FF

Comment 15 Mike FABIAN 2020-06-25 13:05:24 UTC

Created attachment 12661 [details]
0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch

Use "make install" instead of only changing the UTF-8 file.

Comment 16 Mike FABIAN 2020-06-28 12:50:33 UTC

Fixed in glibc master.

Comment 17 Bruno Haible 2021-12-30 00:35:10 UTC

(In reply to Florian Weimer from comment #6)
> Yes, I think it's here:
> 
> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=c760ad33183418a8f103152ff43d57fabbc3949d;hb=HEAD

I have applied an equivalent change to the uc_width function in gnulib:
https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=commitdiff;h=8026587b94e4274f3406a36bc89348a24ea86b6a

Experiments with xterm did not convince me, but experiments with gnome-terminal did, since gnome-terminal is widely used and is ahead in terms of Unicode support.

And the fact that Unicode's EastAsianWidth.txt assigns width 2 to these characters is irrelevant, because https://www.unicode.org/reports/tr11/ makes it clear that its focus is about traditional Japanese rendering engines - but such traditional code cannot handle conjoining Hangul Jamo anyway. Here we need to care about the Unicode-compliant rendering engines (such as the one in gnome-terminal), not the legacy rendering engines.