This is the mail archive of the
cygwin
mailing list for the Cygwin project.
[Fwd: [1.7] wcwidth failing configure tests]
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: newlib at sourceware dot org
- Cc: cygwin at cygwin dot com
- Date: Tue, 12 May 2009 18:54:04 +0200
- Subject: [Fwd: [1.7] wcwidth failing configure tests]
- Reply-to: newlib at sourceware dot org
Forwarded to newlib.
----- Forwarded message from Eric Blake -----
> Date: Tue, 12 May 2009 16:02:04 +0000 (UTC)
> From: Eric Blake
> Subject: [1.7] wcwidth failing configure tests
> To: cygwin AT cygwin DOT com
>
> I noticed this failure in various configure scripts (findutils, coreutils, ...):
>
> checking whether wcwidth works reasonably in UTF-8 locales... no
>
> I've reduced it to a STC:
>
> #include <locale.h>
> #include <wchar.h>
> int main ()
> {
> int i = 0;
> if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
> {
> if (wcwidth (0x0301) > 0)
> i |= 1;
> if (wcwidth (0x200B) > 0)
> i |= 2;
> }
> return i;
> }
>
> The return value should be 0 but is coming back as 3; 0x0301 is a combining
> mark which should occupy no space on its own, and 0x200b is a 0-width space,
> according to Unicode 5.1 (and earlier, to some extent). And that probably
> means that other places within wcwidth() are broken.
----- End forwarded message -----
wcwidth returns 1 if iswprint returns true. I had a quick debug attempt
and it turns out that the entire range 0x0300..0x034f is marked as
printable in the u3 array in libc/ctype/utf8print.h. The entire range
0x0300..0x034f are combining characters which are printable, but have
zero width.
200b..200d are all three zero-width characters but all three are also
printable.
Scanning the Unicode 5.1 standard, I see a couple of these characters,
which are printable but have zero width:
0300..036f
0483..0489
200b..200f
20d0..20ea
3099..309a
fe20..fe23 (not sure about them. Each of them is the half of a full combined
char which doesn't make sense alone, afaics)
feff
and a couple of musical symbols in the 0x1d1xx range
How can we fix this problem? Should we hardcode a check for the above
character values in wcwidth?
And here's another question. The utf8*.h files claim they have been
generated from the unicode.txt file of the Unicode 3.2 standard. Do we
have the script which generated the utf8*.h files? Can we regenerate
the files to match the current Unicode 5.1 standard?
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/