[Fwd: [1.7] wcwidth failing configure tests]

Thomas Wolff towo@towo.net
Fri Jun 12 18:56:00 GMT 2009


IWAMURO Motonori wrote to me by private mail:
> I oppose your proposal because I think that it is useless for us.
> 
> 2009/6/6 Thomas Wolff <towo@towo.net>:
>> the intention is that the "codepage" information should be the same
>> for all locales having thbe "UTF-8" (or any other) charmap.  So you
>> cannot freely change width information among locales with the same
>> charmap.
> 
> I don't think that there is such a restriction.
> The standard of the character doesn't provide for the width of the
> character as a standard.
I'm not sure which "standard" you are referring to.
I have checked source data files in /usr/share/i18n/charmaps on my Linux system, e.g. "UTF-8.gz".
These files are used when creating a new locale with the "localedef" command.
They contain not only the mapping but also (by the end of the file) a 
list of combining and double-width characters. So obviously, even 
stronger than I had argued, this would imply a scheme of predefined 
character widths defined by each such "charmap", thus assuming that 
character widths are the same for all locales with the same "charmap".

>> Also, if ja_JP.UTF-8 would mean "CJK width", how would you specify a
>> working locale setting for a terminal that does not run a CJK width
>> font but should yet use other Japanese settings? E.g. with rxvt
>> which does not support CJK width.
> 
> Oh, we ALWAYS have a hard time in this problem VERY VERY VERY much.
> 
> case1: We use only the application that treats the width of the
> character without locale.
No problem.
> case2: We make the patch that solves the character width problem, and
> throw it out up-stream.
Yes, you should go ahead "up-stream", whatever that means in the case of locales.
> case3: We make the patch, and apply it locally.
No, bad idea. All locale-dogmatic people (I'm not one, just warning) 
will bash you for this. What is the situation after remote login? The 
remote system will assume its own locale setting (e.g. "ja_JP.UTF-8") 
to indicate the actual behaviour of its environment properly, which is 
not the case after local implementation of a solution.
> case4: We tearfully give up the correct display of the screen.
> case5: We tearfully give up using the application.
> I selected case5 for rxvt.
> 
No reason to give up.
The approach I've taken in mined is quite successful. The other 
approach, via locale names, will also have limited success provided it 
is taken "up-stream".

>> Thus you could define e.g.
>>        ja_JP.UTF-8@cjk
>> or
>>        ja_JP.UTF-8@cjkwidth
>> to indicate CJK width properties. I guess this is the most compliant way to go.
> 
> I don't think that it is the good idea because:
> 
> - It is "a cygwin-specific solution (or workaround)".
Apparently we agree that a solution should be found that is not cygwin-specific, 
but should be established "up-stream". The question is thus which of the 
discussed mechanisms has a better chance to get accepted up-stream:
- ja_JP.UTF-8 meaning different width data than en_US.UTF-8
or
- ja_JP.UTF-8@cjkwidth meaning different width data than ja_JP.UTF-8

My assumption is that the second proposal (that I made) has a better 
chance, given the existing paradigms of the locale community. But that's 
speculative. If you think you can get your proposal passed "up-stream", 
go ahead and try it, please! If you succeed, everything is fine.

> - In NetBSD, the change to which wcwidth of East Asian Ambiguous Characters returns 2 by CJK locale is planned.
So the same issue (of compliance and portability, especially in the 
remote case) should be discussed in the NetBSD community.
(Is there a suitable forum or mailing list to check?)

> - and, I don't think that I need make special cases give priority more
> than general cases.

> >> - I heard that there is an existing implementation that behave like my
> >> proposal. (Sorry, I didn't hear the system name.)
> > Even if so, I think the way I described is more compatible with the locale
> > mechanism as used elsewhere.

> I think that ALL locale implementations should treat East Asian
> Ambiguous Character Width as 2 for CJK locale.
Again, I agree that IF you manage to get ALL implementations to follow 
this approach, the solution is fine. Please go ahead.


> >> It is no problem because we -- most Japanese language users -- need
> >> not change the settings of mintty and locale after first setup.
> >> We set LANG=ja_JP.UTF-8 and select a Japanese font for mintty.
> > In any case, mined running in mintty will detect CJK width itself,
> > regardless of locale setting, with coming versions of both programs
> > even when it gets changed on-the-fly :)
> Sorry, I can't understand above because I am not good at English.
Well, even if your proposal would finally be implemented, MinTTY will 
still be able to choose different fonts and depending on which font is 
selected, run in locale-width-compliant or width-breaking mode.
* My solution could be tweaked to handle this.
* Auto-detection (of mined) can handle it already.
* Your solution could probably not handle it.



> I don't think so. I think that we should consider the following issues
> if a new mechanism is introduced.

> The existing locale / terminal API don't support:
> - Unicode BiDi.
> - Unicode control characters.
> - Unicode combining characters.
> - Multilingualization. (*)
> - Detect font/fontset information selected with terminal emulator.
> (including, need to consider the case of no-tty)
Not sure what you intend to say with these remarks. Locale and 
terminal APIs are actually two different things. And locale API can 
e.g. handle combining characters (by wcwidth returning 0).



> * Now, we can't use Japanese, Chinese, and Korean at the same time
> even if we use Unicode.
>   Because many font glyphs are quite different even if the code point
> is the same in each language.
This is a completely different issue and it should be easy to solve it 
by simply choosing an appropriate font.



> > With my proposal, an application that wishes to auto-adjust on width
> > properties (maybe even when changing) and which (unlike mined) uses
> > the system wcwidth functions could proceed as follows:
> > * Detect CJK width by using a simple test string width detection.
> > * (Optional) When receiving a SIGWINCH signal (future version of MinTTY),
> >  repeat this detection.
> > * If e.g. LC_CTYPE starts with "ja_JP.UTF-8", call setlocale with
> >  either "ja_JP.UTF-8@cjkwidth" or "ja_JP.UTF-8".
> How to detect it? The application using wcwidth is not necessarily
> executed with terminal emulator. (e.g. text formatter)
OK, my arguments refer to an interactive application that wants to 
control the precise representation of text on the screen.
If for example a text formatter formats for paper printing, it would 
need to apply completely different assumptions anyway. The dreadful 
single/double width issue of cell-based terminals isn't relevant at 
all in that case.



> >> > I'm not happy with the idea of a cygwin-specific solution (or workaround).
> >> I think that it is not cygwin-specific solution.
> > As I tried to suggest above, using "UTF-8" for different width data on one
> > system would be quite specific, using the "@" modifier syntax would not.
> "UTF-8" is only an encoding scheme. It does not specify the character width.
OK, we had this argument above, and we were both not quite right before.
The essence is that whatever you get established up-stream may turn out 
to be a working solution, so I would appreciate if you go ahead and persuade 
some "up-stream" people...


Best regards,
Thomas

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/



More information about the Cygwin mailing list