4335 – charmaps/UTF-8: EastAsianAmbiguous character width is always 1

Bug 4335 - charmaps/UTF-8: EastAsianAmbiguous character width is always 1

Summary: charmaps/UTF-8: EastAsianAmbiguous character width is always 1

Status:	RESOLVED WONTFIX

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-04-08 13:18 UTC by VDR dai (bugzilla)
Modified:	2018-04-20 13:54 UTC (History)
CC List:	3 users (show)

See Also:	19852
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description VDR dai (bugzilla) 2007-04-08 13:18:00 UTC

According to /usr/share/i18n/charmaps/UTF-8.gz,
Character width is 1 by default.  W(Wide) and F(Full Width) are 2.

% Character width according to Unicode 3.2.
% - Default width is 1.
% - Double-width characters have width 2; generated from
%        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

A(Ambiguous) is expected that it is context-sensitive,
but its width is always 1 irrelevant to context.

According to http://www.unicode.org/reports/tr11/#Recommendations

> When mapping Unicode to East Asian legacy character encodings
> 
>     * Wide Unicode characters always map to fullwidth characters.
>     * Narrow (and neutral) Unicode characters always map to halfwidth characters.
>     * Halfwidth Unicode characters always map to halfwidth characters.
>     * Ambiguous Unicode characters always map to fullwidth characters.

I think EastAsianAmbiguous character width should be 2 in CJK UTF-8 locale.

Comment 1 Bruno Haible 2007-06-02 23:43:08 UTC

The "character width" is mostly useful when dealing with cell-based
terminal emulators.

IMO it makes no sense to make such a change in glibc (i.e. to create an
alternative charmap UTF-8-CJK and to build locales like ja_JP.UTF-8 against
it) in isolation. What needs to be considered is the majority of the terminal
emulators; see for example the list at
  http://packages.debian.org/stable/virtual/x-terminal-emulator
If you change the most important among these terminal emulators to choose
their font configuration according to the locale, in such a way that in CJK
locales the Ambiguous Width characters have width 2, and in other locales they
have width 1, _then_ IMO the change makes also sense in glibc.

Comment 2 VDR dai (bugzilla) 2007-06-10 13:05:06 UTC

I created UTF-8-CJK (EastAsianAmbiguous character width 2) and built ja_JP.UTF-8
against it.
Then, I test terminal emulators; debian's x-terminal-emulator list.
Terminal Emulators that be able to handle UTF-8 works well and chooses font
correctly.
(I leave terminal emulators that be unable to handle UTF-8 out of consideration)

works well:

gnome-terminal
konsole
mlterm (mlterm-tiny)
rxvt (rxvt-ml)
rxvt-beta
rxvt-unicode (rxvt-unicode-ml, rxvt-unicode-lite)
tilda
xfce4-terminal
xterm

does not handle UTF-8:

aterm (aterm-ml)
eterm
kterm
mrxvt (mrxvt-cjk, mrxvt-mini)
multi-gnome-terminal
wterm (wterm-ml)

does not handle ja_JP.eucJP:

hanterm-xf
powershell
pterm
terminal.app
xvt

Comment 3 VDR dai (bugzilla) 2007-11-27 16:04:21 UTC

Any progress?
It is still present in glibc 2.7 (Debian).

% /lib/libc.so.6
GNU C Library stable release version 2.7, by Roland McGrath et al.
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.2.3 20071123 (prerelease) (Debian 4.2.2-4).
Compiled on a Linux >>2.6.22.12<< system on 2007-11-26.
Available extensions:
	crypt add-on version 2.1 by Michael Glad and others
	GNU Libidn by Simon Josefsson
	Native POSIX Threads Library by Ulrich Drepper et al
	BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

% cat test.c
#include <stdio.h>
#include <locale.h>
#define __USE_XOPEN
#include <wchar.h>

int main( void ) {
  wchar_t i;
  wchar_t euc, utf8;

  for( i = 0x00; i < 0x100; i++ ) {
    setlocale( LC_CTYPE, "ja_JP.eucJP" );
    euc = wcwidth( i );
    setlocale( LC_CTYPE, "ja_JP.UTF-8" );
    utf8 = wcwidth( i );

    if( euc > 0 && euc != utf8 ) {
      fprintf( stdout, "%02x : %d : %d : [%c]\n", i, euc, utf8, i );
    }
  }

  return 0;
}

Using default UTF-8 locale:

% ./a.out
a1 : 2 : 1 : [�¢Â]
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a4 : 2 : 1 : [�¢ð]
a6 : 2 : 1 : [üü]
a7 : 2 : 1 : [¡ø]
a8 : 2 : 1 : [¡¯]
a9 : 2 : 1 : [�¢í]
aa : 2 : 1 : [�¢ì]
ac : 2 : 1 : [¢Ì]
ae : 2 : 1 : [�¢î]
af : 2 : 1 : [�¢´]
b0 : 2 : 1 : [¡ë]
b1 : 2 : 1 : [¡Þ]
b4 : 2 : 1 : [¡]
b6 : 2 : 1 : [¢ù]
b8 : 2 : 1 : [�¢±]
ba : 2 : 1 : [�¢ë]
bf : 2 : 1 : [�¢Ä]
c0 : 2 : 1 : [�ª¢]
c1 : 2 : 1 : [�ª¡]
c2 : 2 : 1 : [�ª¤]
c3 : 2 : 1 : [�ªª]
c4 : 2 : 1 : [�ª£]
c5 : 2 : 1 : [�ª©]
c6 : 2 : 1 : [�©¡]
c7 : 2 : 1 : [�ª®]
c8 : 2 : 1 : [�ª²]
c9 : 2 : 1 : [�ª±]
ca : 2 : 1 : [�ª´]
cb : 2 : 1 : [�ª³]
cc : 2 : 1 : [�ªÀ]
cd : 2 : 1 : [�ª¿]
ce : 2 : 1 : [�ªÂ]
cf : 2 : 1 : [�ªÁ]
d1 : 2 : 1 : [�ªÐ]
d2 : 2 : 1 : [�ªÒ]
d3 : 2 : 1 : [�ªÑ]
d4 : 2 : 1 : [�ªÔ]
d5 : 2 : 1 : [�ªØ]
d6 : 2 : 1 : [�ªÓ]
d7 : 2 : 1 : [¡ß]
d8 : 2 : 1 : [�©¬]
d9 : 2 : 1 : [�ªã]
da : 2 : 1 : [�ªâ]
db : 2 : 1 : [�ªå]
dc : 2 : 1 : [�ªä]
dd : 2 : 1 : [�ªò]
de : 2 : 1 : [�©°]
df : 2 : 1 : [�©Î]
e0 : 2 : 1 : [�«¢]
e1 : 2 : 1 : [�«¡]
e2 : 2 : 1 : [�«¤]
e3 : 2 : 1 : [�«ª]
e4 : 2 : 1 : [�«£]
e5 : 2 : 1 : [�«©]
e6 : 2 : 1 : [�©Á]
e7 : 2 : 1 : [�«®]
e8 : 2 : 1 : [�«²]
e9 : 2 : 1 : [�«±]
ea : 2 : 1 : [�«´]
eb : 2 : 1 : [�«³]
ec : 2 : 1 : [�«À]
ed : 2 : 1 : [�«¿]
ee : 2 : 1 : [�«Â]
ef : 2 : 1 : [�«Á]
f0 : 2 : 1 : [�©Ã]
f1 : 2 : 1 : [�«Ð]
f2 : 2 : 1 : [�«Ò]
f3 : 2 : 1 : [�«Ñ]
f4 : 2 : 1 : [�«Ô]
f5 : 2 : 1 : [�«Ø]
f6 : 2 : 1 : [�«Ó]
f7 : 2 : 1 : [¡à]
f8 : 2 : 1 : [�©Ì]
f9 : 2 : 1 : [�«ã]
fa : 2 : 1 : [�«â]
fb : 2 : 1 : [�«å]
fc : 2 : 1 : [�«ä]
fd : 2 : 1 : [�«ò]
fe : 2 : 1 : [�©Ð]
ff : 2 : 1 : [�«ó]

Using modified (EastAsianAmbiguous character width == 2,
according to EastAsianWidth-5.0.0.txt) UTF-8 locale:

% ./a.out
a2 : 2 : 1 : [¡ñ]
a3 : 2 : 1 : [¡ò]
a6 : 2 : 1 : [üü]
a9 : 2 : 1 : [�¢í]
ac : 2 : 1 : [¢Ì]
af : 2 : 1 : [�¢´]
c0 : 2 : 1 : [�ª¢]
c1 : 2 : 1 : [�ª¡]
c2 : 2 : 1 : [�ª¤]
c3 : 2 : 1 : [�ªª]
c4 : 2 : 1 : [�ª£]
c5 : 2 : 1 : [�ª©]
c7 : 2 : 1 : [�ª®]
c8 : 2 : 1 : [�ª²]
c9 : 2 : 1 : [�ª±]
ca : 2 : 1 : [�ª´]
cb : 2 : 1 : [�ª³]
cc : 2 : 1 : [�ªÀ]
cd : 2 : 1 : [�ª¿]
ce : 2 : 1 : [�ªÂ]
cf : 2 : 1 : [�ªÁ]
d1 : 2 : 1 : [�ªÐ]
d2 : 2 : 1 : [�ªÒ]
d3 : 2 : 1 : [�ªÑ]
d4 : 2 : 1 : [�ªÔ]
d5 : 2 : 1 : [�ªØ]
d6 : 2 : 1 : [�ªÓ]
d9 : 2 : 1 : [�ªã]
da : 2 : 1 : [�ªâ]
db : 2 : 1 : [�ªå]
dc : 2 : 1 : [�ªä]
dd : 2 : 1 : [�ªò]
e2 : 2 : 1 : [�«¤]
e3 : 2 : 1 : [�«ª]
e4 : 2 : 1 : [�«£]
e5 : 2 : 1 : [�«©]
e7 : 2 : 1 : [�«®]
eb : 2 : 1 : [�«³]
ee : 2 : 1 : [�«Â]
ef : 2 : 1 : [�«Á]
f1 : 2 : 1 : [�«Ð]
f4 : 2 : 1 : [�«Ô]
f5 : 2 : 1 : [�«Ø]
f6 : 2 : 1 : [�«Ó]
fb : 2 : 1 : [�«å]
fd : 2 : 1 : [�«ò]
ff : 2 : 1 : [�«ó]

% diff -u utf8-cjk-default utf8-cjk-modified
--- utf8-cjk-default	2007-11-28 01:03:07.000000000 +0900
+++ utf8-cjk-modified	2007-11-28 01:02:55.000000000 +0900
@@ -1,29 +1,15 @@
-a1 : 2 : 1 : [�¢Â]
 a2 : 2 : 1 : [¡ñ]
 a3 : 2 : 1 : [¡ò]
-a4 : 2 : 1 : [�¢ð]
 a6 : 2 : 1 : [üü]
-a7 : 2 : 1 : [¡ø]
-a8 : 2 : 1 : [¡¯]
 a9 : 2 : 1 : [�¢í]
-aa : 2 : 1 : [�¢ì]
 ac : 2 : 1 : [¢Ì]
-ae : 2 : 1 : [�¢î]
 af : 2 : 1 : [�¢´]
-b0 : 2 : 1 : [¡ë]
-b1 : 2 : 1 : [¡Þ]
-b4 : 2 : 1 : [¡]
-b6 : 2 : 1 : [¢ù]
-b8 : 2 : 1 : [�¢±]
-ba : 2 : 1 : [�¢ë]
-bf : 2 : 1 : [�¢Ä]
 c0 : 2 : 1 : [�ª¢]
 c1 : 2 : 1 : [�ª¡]
 c2 : 2 : 1 : [�ª¤]
 c3 : 2 : 1 : [�ªª]
 c4 : 2 : 1 : [�ª£]
 c5 : 2 : 1 : [�ª©]
-c6 : 2 : 1 : [�©¡]
 c7 : 2 : 1 : [�ª®]
 c8 : 2 : 1 : [�ª²]
 c9 : 2 : 1 : [�ª±]
@@ -39,44 +25,23 @@
 d4 : 2 : 1 : [�ªÔ]
 d5 : 2 : 1 : [�ªØ]
 d6 : 2 : 1 : [�ªÓ]
-d7 : 2 : 1 : [¡ß]
-d8 : 2 : 1 : [�©¬]
 d9 : 2 : 1 : [�ªã]
 da : 2 : 1 : [�ªâ]
 db : 2 : 1 : [�ªå]
 dc : 2 : 1 : [�ªä]
 dd : 2 : 1 : [�ªò]
-de : 2 : 1 : [�©°]
-df : 2 : 1 : [�©Î]
-e0 : 2 : 1 : [�«¢]
-e1 : 2 : 1 : [�«¡]
 e2 : 2 : 1 : [�«¤]
 e3 : 2 : 1 : [�«ª]
 e4 : 2 : 1 : [�«£]
 e5 : 2 : 1 : [�«©]
-e6 : 2 : 1 : [�©Á]
 e7 : 2 : 1 : [�«®]
-e8 : 2 : 1 : [�«²]
-e9 : 2 : 1 : [�«±]
-ea : 2 : 1 : [�«´]
 eb : 2 : 1 : [�«³]
-ec : 2 : 1 : [�«À]
-ed : 2 : 1 : [�«¿]
 ee : 2 : 1 : [�«Â]
 ef : 2 : 1 : [�«Á]
-f0 : 2 : 1 : [�©Ã]
 f1 : 2 : 1 : [�«Ð]
-f2 : 2 : 1 : [�«Ò]
-f3 : 2 : 1 : [�«Ñ]
 f4 : 2 : 1 : [�«Ô]
 f5 : 2 : 1 : [�«Ø]
 f6 : 2 : 1 : [�«Ó]
-f7 : 2 : 1 : [¡à]
-f8 : 2 : 1 : [�©Ì]
-f9 : 2 : 1 : [�«ã]
-fa : 2 : 1 : [�«â]
 fb : 2 : 1 : [�«å]
-fc : 2 : 1 : [�«ä]
 fd : 2 : 1 : [�«ò]
-fe : 2 : 1 : [�©Ð]
 ff : 2 : 1 : [�«ó]

Comment 4 VDR dai (bugzilla) 2008-11-25 17:27:41 UTC

Here is rxvt-unicode author's opinion.

http://lists.schmorp.de/pipermail/rxvt-unicode/2007q1/000402.html

> > > > ja_JP.eucJP locale is fixed by src/rxvt.h r1.265.
> > > > But ja_JP.UTF-8 locale is still weird.
> > >
> > > No, its correct, thats what the locale specified.
> >
> > Do you mean that ja_JP.UTF-8 locale specifies
> > "0xd7" (EastAsianAmbiguous) is HALFWIDTH and
> > rxvt-unicode simply respects it?
> 
> Basically, yes. At least that is how it *should* be: urxvt always respects
> your locale, as should all other programs do too. If your locale says
> something and urxvt doesn't follow that, that is considered a bug in
> urxvt.
> 
> > > > Do you plan to merge doc/solaris9.patch?
> > >
> > > No, thats an ugly hack around solaris being broken.
> >
> > Uh, I mean mk_wcwidth() that is a part of doc/solaris9.patch.
> > mk_wcwidth() variant with configurable option is imported into vim,
> > xterm and so on.
> 
> Yes, they are all buggy as long as they use that.
> 
> > Yes, rxvt-unicode respects that locale tells.
> > But vim, xterm, etc have option that gives EastAsianAmbiguous
> > special treatment that EastAsiwnAmbiguous char width is 2.
> > vim has ambiwidth=double option, xterm has -cjk_width option.
> 
> Yes, I know. But its stupid to add such hacks to each and every program
> and force the user to enable them. The right way is to use or modify the
> locale, then suddenly all well-written programs with or without such hacks
> just magically work.
> 
> Ignoring the locale is just wrong. It leads to interoperability
> problems between programs that simply wouldn't exist if everybody just
> respected the locale instead of relying on their own hacks.
> 
> The only justification for adding hacks is for systems that do not support
> required locales (such as one providing utf-8), but those systems either
> die or get upgraded, so the time is much better spent improving the locale
> system on those rare sytems rather than adding hacks to each and every
> program.
> 
> > Do you mean locale is wrong/broken then programs do not need to
> 
> If the locale specifies a character width that you do not want, then the
> locale is pretty much broken from your perspective, isn't it? At least its
> not the locale you want.
> 
> > Do I need to ask not rxvt-unicode but glibc?
> 
> I think glibc (or any software distribution either using it or something
> else) should provide the means to configure it regarding such details such
> as character width, at least for commonly wanted cases such as east asian
> widths.
> 
> I am open to reasoning against my arguments, but to change my mind one
> would have to overcome the arguments above. It just plain makes no
> sense to hack eahc and every program on the world to workaround locale
> limitations: there are far more editors and terminals around than libcs.

Comment 5 VDR dai (bugzilla) 2009-02-28 07:38:34 UTC

Each application should implements each approach
for EastAsianAmbiguous character width issue now.
For example, own one, Markus Kuhn's wcwidth()
(http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c).

Unable to expand glibc wcwidth()'s current implementation
and locale definition, then, could glibc offer common method
for this issue?

Comment 6 Jackie Rosen 2014-02-16 19:41:20 UTC Comment hidden (spam)

*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.

Comment 7 Joseph Myers 2017-08-28 16:39:58 UTC

Restoring changes lost in system crash and restore from backup.

https://sourceware.org/ml/glibc-bugs/2017-08/msg00369.html