Robert Ross <rob.ross@ymail.com> writes: > Thank you for maintaining glibc's "localedata/charmaps/UTF-8". It is > good that most "HANGUL JUNGSEONG" characters have zero width due to > "<U1160>...<U11FF> 0" on line 48775 but strange that the newer "HANGUL > JUNGSEONG" characters have width 1 since there is no > "<UD7B0>...<UD7C6> 0". Similarly most "HANGUL JONGSEONG" characters > have width 0 due to line 48775 but the newer ones have width 1 since > there is no "<UD7CB>...<UD7FB> 0". Please correct this if it's an > error or explain if it's not. In https://www.unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt all of these have width "N". http://www.unicode.org/reports/tr11/ says: 6.2 Combining Marks > Combining marks have been classified and are given a property > assignment based on their typical applicability. For example, > combining marks typically applied to characters of class N, Na, or W > are classified as A. Combining marks for purely non-East Asian scripts > are marked as N, and nonspacing marks used only with wide characters > are given a W. Even more so than for other characters, the > East_Asian_Width property for combining marks is not the same as their > display width. > > In particular, nonspacing marks do not possess actual advance > width. Therefore, even when displaying combining marks, the > East_Asian_Width property cannot be related to the advance width of > these characters. However, it can be useful in determining the > encoding length in a legacy encoding, or the choice of font for the > range of characters including that nonspacing mark. The width of the > glyph image of a nonspacing mark should always be chosen as the > appropriate one for the width of the base character. See also: https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c5 > We also agree that the Hangul Jamo U+1160‥U+11FF are sort > of "combining characters" although they are not marked as such > in the Unicode data. But they are fragments of Hangul characters > which combine. So it seems correct to mark them as width 0.
So I think it is best to set all JUNGSEONG/JONGSEONG characters to width 0.
Some information from a chat with Thorsten Glaser (in German): <mfabian> Alles was JUNGSEONG oder JONGSEONG im Namen hat, ist so ein Combining Character? [20年06月15日 21:38:17] <MirWarm> soweit ich das verstanden habe, sind die koreanischen zeichen immer choseong + j{u,o}ngseong [20年06月15日 21:54:15] <MirWarm> The Hangul jamo are divided into three classes: choseong (Leading consonants), jungseong (Vowels) and jongseong (Trailing consonants) which in the rest of this write-up will be referred to as L, V and T. [20年06月15日 21:58:54] <MirWarm> A standard Hangul syllable is composed as (L+V+T*) [20年06月15日 21:58:55] <MirWarm> ah, ja [20年06月15日 21:58:57] <MirWarm> also die choseong sind wohl nicht required im koreanischen Skript, aber in Unicode wohl, man muß dann mit U+115F anfangen [20年06月15日 21:59:24] <MirWarm> choseong ist initial (C), jungseong ist medial (G) und nucleus (V), jongseong ist coda (K) [20年06月15日 22:00:15] <MirWarm> und koreanische silbenwörter sind (C)(G)V(K) [20年06月15日 22:00:27] <MirWarm> und in Unicode nimmt man U+115F, wenn C fehlt [20年06月15日 22:00:53] <MirWarm> 115F ist 1, die anderen sind 0 [20年06月15日 22:01:06] <MirWarm> paßt [20年06月15日 22:01:07] <MirWarm> bin in ~5 minuten wieder da [20年06月15日 22:01:14] *** MirWarm (~mird@2001-4dd7-dca-0-21f-3bff-fe0d-cbb1.ipv6dyn.netcologne.de) has quit: Quit: using sirc version 2.211-MirDebian-20181124-1+ssfe (RANDOM=2406) [20年06月15日 22:01:15] *** MirWarm (~mird@x61e.mirbsd.org) has joined channel #mirbsd [20年06月15日 22:06:44] <MirWarm> re [20年06月15日 22:07:05] <MirWarm> ich mach bei mir dann gleich mal D7B0 .. D7FF noch auf 0 [20年06月15日 22:08:33] <MirWarm> so, committed [20年06月15日 22:31:19]
Created attachment 12623 [details] 0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch
Does gnulib need updating as well?
(In reply to Florian Weimer from comment #4) > Does gnulib need updating as well? I don’t know. Does gnulib have width data?
Yes, I think it's here: http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=c760ad33183418a8f103152ff43d57fabbc3949d;hb=HEAD
Erk… glibc is particular about not defining widths of not-defined characters. Besides D7FC‥D7FF (which gave me an error in the output from my own scripts), D7C7‥D7CA are not yet assigned and so probably need to be excluded in glibc. Should they ever be defined, we’ll need to adjust here, so it’s probably better to iterate over the entire D7C0‥D7FF range and ony change widths for defined codepoints from the current UCD version.
Created attachment 12629 [details] 0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch Updated patch to ommit the unassigned characters.
(In reply to Thorsten Glaser from comment #7) > Erk… glibc is particular about not defining widths of not-defined characters. > > Besides D7FC‥D7FF (which gave me an error in the output from my own > scripts), D7C7‥D7CA are not yet assigned and so probably need to be excluded > in glibc. > > Should they ever be defined, we’ll need to adjust here, so it’s probably > better to iterate over the entire D7C0‥D7FF range and ony change widths for > defined codepoints from the current UCD version. Thank you for noticing that! I was aware that glibc has a problem with defining width of unassigned characters, therefore I used for key in list(range(0xD7B0, 0xD7FC)): instead of for key in list(range(0xD7B0, 0xD800)): because D7FC and D7FF are undefined and localedef gave me errors when I included them. Surprisingly localedef did not give errors for the unassigned D7C7‥D7CA ... I had checked the range manually and thought all characters from D7B0 to D7FB were assigned, but apparently I missed D7C7‥D7CA. I improved the generator script a bit to omit the unassigned characters, if these get defined in future, the script would add them.
Looks okay (but now you can use 0xD800 in the range call), this is similar to what I did in my script http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eaw2glibc that postprocesses the width output I normally use (script http://www.mirbsd.org/cvs.cgi/contrib/code/Snippets/eawparse and http://www.mirbsd.org/cvs.cgi/X11/xc/programs/xterm/wcwidth.c?rev=HEAD contains an example of its output) into glibc-compatible format. The output I get (for UCD 13.0.0) is identical to yours.
(In reply to Thorsten Glaser from comment #10) > Looks okay (but now you can use 0xD800 in the range call), Yes, I could. But if 0xD7FE and 0xD7FF ever get assigned, would they be characters of the same type? I would have to check that manually anyway. > The output I get (for UCD 13.0.0) is identical to yours. Great!
According to Blocks.txt, yes. Unicode does assign characters to blocks.
(In reply to Thorsten Glaser from comment #12) > According to Blocks.txt, yes. Unicode does assign characters to blocks. D7B0..D7FF; Hangul Jamo Extended-B I think you are right, I’ll change the script to end the range at the end of that block, that seems more likely to be correct if these characters ever get assigned.
Created attachment 12651 [details] 0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch End the range at 0xD7FF
Created attachment 12661 [details] 0001-Set-width-of-JUNGSEONG-JONGSEONG-characters-from-UD7.patch Use "make install" instead of only changing the UTF-8 file.
Fixed in glibc master.
(In reply to Florian Weimer from comment #6) > Yes, I think it's here: > > http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=c760ad33183418a8f103152ff43d57fabbc3949d;hb=HEAD I have applied an equivalent change to the uc_width function in gnulib: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=commitdiff;h=8026587b94e4274f3406a36bc89348a24ea86b6a Experiments with xterm did not convince me, but experiments with gnome-terminal did, since gnome-terminal is widely used and is ahead in terms of Unicode support. And the fact that Unicode's EastAsianWidth.txt assigns width 2 to these characters is irrelevant, because https://www.unicode.org/reports/tr11/ makes it clear that its focus is about traditional Japanese rendering engines - but such traditional code cannot handle conjoining Hangul Jamo anyway. Here we need to care about the Unicode-compliant rendering engines (such as the one in gnome-terminal), not the legacy rendering engines.