This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: 16-bit wchar_t on Windows and Cygwin

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin at cygwin dot com, bug-gnulib at gnu dot org, bug-coreutils at gnu dot org
Date: Wed, 2 Feb 2011 13:21:02 +0100
Subject: Re: 16-bit wchar_t on Windows and Cygwin
References: <201101310304.42975.bruno@clisp.org> <4D46EA2B.1010307@redhat.com> <201102021229.04623.bruno@clisp.org> <20110202121442.GC2675@calimero.vinschen.de>
Reply-to: cygwin at cygwin dot com, bug-gnulib at gnu dot org, bug-coreutils at gnu dot org

On Feb  2 13:14, Corinna Vinschen wrote:
> On Feb  2 12:29, Bruno Haible wrote:
> > Hello Eric,
> > 
> > > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > > ...
> > > > What consequences does this have?
> > > > 
> > > >   1) All code that uses the functions from <wctype.h> (wide character
> > > >      classification and mapping) or wcwidth() malfunctions on strings that
> > > >      contains Unicode characters outside the BMP, i.e. outside the range
> > > >      U+0000..U+FFFF.
> > > 
> > > Not necessarily.  Such code falls outside of POSIX, but it may still be
> > > a well-behaved extension if given sane behavior for how to deal with
> > > surrogates.
> > 
> > No. Code that uses <wctype.h> and wcwidth() is written precisely according
> > to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> > is in UTF-16 encoding. There simply is no way to define these functions
> > in a reasonable way for surrogates.
> > 
> > For example:
> >   U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
> >   U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
> >   U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
> >   U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
> >   U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
> >   U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
> > There is no way that a system can provide this information through a
> > function 'iswalpha' that takes a single wchar_t argument.
> 
> iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
> the function can return the correct value, provided that the application
> converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

And, please note the wording in SUSv4, for instance in
http://calimero.vinschen.de/susv4/functions/iswalpha.html

  The wc argument is a wint_t, the value of which the application shall
                       ^^^^^^                         ^^^^^^^^^^^
  ensure is a wide-character code corresponding to a valid character in
  the current locale, or equal to the value of the macro WEOF. If the
  argument has any other value, the behavior is undefined.

I don't see any words in that which would disallow to convert UTF-16
wchar_t surrogates to a wint_t UTF-32 value before calling one of
the wctype functions.  Just like you have to be careful not to call
the ctype functions with a signed char.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

References:
- Re: 16-bit wchar_t on Windows and Cygwin
  - From: Bruno Haible
- Re: 16-bit wchar_t on Windows and Cygwin
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]