[Fwd: [1.7] wcwidth failing configure tests]

Wed May 20 16:52:00 GMT 2009

Corinna Vinschen wrote:

> On May 12 17:56, Andy Koppe wrote:
> > > And here's another question. ?The utf8*.h files claim they have been
> > > generated from the unicode.txt file of the Unicode 3.2 standard. ?Do we
> > > have the script which generated the utf8*.h files? ?Can we regenerate
> > > the files to match the current Unicode 5.1 standard?
I've updated my editor mined to Unicode 5.1 data already. I can provide 
an according wcwidth function if that's desired. I also have scripts 
for semi-automatic generation of this information, however "semi" as I said, 
to be improved.

> > There's Markus Kuhn's wcwidth implementation, which says it's based on
> > Unicode 5.0:
> > 
> > http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
> 
> This looks nice.
I'm sure Markus will update to 5.1 one day too...

> > Trouble is, there's the thorny issue of the "CJK Ambiguous Width"
> > category of characters, which consists of things like Greek and
> > Cyrillic letters as well as line drawing symbols. Those have a width
> > of 1 in Western use, yet with CJK fonts they have a width of 2. That's
> > why Markus Kuhn's code includes the mk_wcswidth_cjk() variant.
> 
> We should use the standard variation alone, imho.
> 
> And we need some workaround for UTF-16 systems like Cygwin.
> Unfortunately, surrogate pairs only work well as part of a string, not
> as standalone chars.  So wcwidth would return -1 for each single char,
> but wcswidth could be tweaked to handle them gracefully.
This gets me to the related question how to output non-BMP characters;
currently, the cygwin console display them all as two square boxes, 
using two screen columns. This indicates that probably just the single 
surrogate characters are being output.
Could proper non-BMP character display be achieved by simply combining 
the surrogates and outputting them to Windows as a true Unicode character?
(The Windows function would need to be 32 bit which I don't know, 
the string elements could stay as they are.)
Just an idea which might lead to a simple solution.

> On May 15 00:58, IWAMURO Motonori wrote:
> > 2009/5/13 Corinna Vinschen <vinschen@redhat.com>:
> > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width"
> > >> ... (see above)
> > > We should use the standard variation alone, imho.
> > I don't think so.
> > 
> > 1) It is very very inconvenient for me :-)
> > 
> > 2) Unicode Standard Annex #11
> > http://www.unicode.org/unicode/reports/tr11/ recommends:
> > > 5 Recommendations
> > (snip)
> > > When processing or displaying data
> > (snip)
> > > Ambiguous characters behave like wide or narrow characters depending
> > > on the context (language tag, script identification, associated
> > > font, source of data, or explicit markup; all can provide the
> > > context). If the context cannot be established reliably, they should
> > > be treated as narrow characters by default.
> > 
> > The recommendation is independent of legacy encoding.
> > 
> > I think that a new locale category that specifies the "context" is necessary.
> > Because the "context" influences only the display or text layout.
> > 
> > However, there is no such standard now.
> > 
> > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE
> > is 'ja', 'ko', 'vi' or 'zh'.
The problem with this is
1. As you say, there is no standard.
2. If you wish to handle character widths compliant with the terminal 
   your application is running in, there is no guarantee that your 
   assumption of CJK width (or the actual locale setting if that model 
   would be implemented) does indeed reflect the terminal's width properties.
3. In mintty, you can dynamically change width properties by selecting 
   different fonts; mintty changes CJK width behaviour according to certain 
   font properties. "static" configuration in your shell using a locale 
   variable would not reflect this change
   I see two ways to handle this:
   a) Ask Andy (author of mintty) to not do this switching; however, 
      I don't know what display consequences that might have. On the 
      other hand, other terminals don't switch either. Or maybe mintty 
      could at leasts issue a warning on CJK width switching, or 
      maintain two separate font lists, or...
   b) Determine the actual CJK width behaviour dynamically. That's what 
      mined does (in addition to other width property detection in general).
      That's why it can handle the alternative quite seamlessly.

> That would be fine with me, but tests for the actual language are not
> used anywhere in newlib, so that's something very new.
So I would suggest not to introduce it before the concept is sufficiently discussed.
And I'm not happy with the idea of a cygwin-specific solution (or workaround).

Kind regards,
Thomas

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/