This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: 16-bit wchar_t on Windows and Cygwin
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: cygwin at cygwin dot com, bug-gnulib at gnu dot org, bug-coreutils at gnu dot org
- Date: Wed, 2 Feb 2011 13:14:42 +0100
- Subject: Re: 16-bit wchar_t on Windows and Cygwin
- References: <201101310304.42975.bruno@clisp.org> <4D46EA2B.1010307@redhat.com> <201102021229.04623.bruno@clisp.org>
- Reply-to: cygwin at cygwin dot com, bug-gnulib at gnu dot org, bug-coreutils at gnu dot org
On Feb 2 12:29, Bruno Haible wrote:
> Hello Eric,
>
> > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > ...
> > > What consequences does this have?
> > >
> > > 1) All code that uses the functions from <wctype.h> (wide character
> > > classification and mapping) or wcwidth() malfunctions on strings that
> > > contains Unicode characters outside the BMP, i.e. outside the range
> > > U+0000..U+FFFF.
> >
> > Not necessarily. Such code falls outside of POSIX, but it may still be
> > a well-behaved extension if given sane behavior for how to deal with
> > surrogates.
>
> No. Code that uses <wctype.h> and wcwidth() is written precisely according
> to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> is in UTF-16 encoding. There simply is no way to define these functions
> in a reasonable way for surrogates.
>
> For example:
> U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true)
> U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false)
> U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false)
> U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false)
> U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true)
> U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true)
> There is no way that a system can provide this information through a
> function 'iswalpha' that takes a single wchar_t argument.
iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte,
the function can return the correct value, provided that the application
converts the UTF-16 surrogate to UTF-32 before calling iswalpha.
> We agree that it is a bug. And it is caused by
> - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
> - there is no way to define the <wctype.h> POSIX functions sanely in this
> setting, and
See above.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple