This is the mail archive of the cygwin mailing list for the Cygwin project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Bug in libiconv?

From: Bruno Haible <bruno at clisp dot org>
To: cygwin at cygwin dot com
Date: Wed, 2 Feb 2011 19:58:12 +0100
Subject: Re: Bug in libiconv?
[resent to the cygwin list; please add bug-gnu-libiconv to your replies]

Hi Corinna,

Thanks for your reply <http://cygwin.com/ml/cygwin/2011-01/msg00410.html>

> > Please CC the bug-gnu-libiconv mailing list when discussing possible
> > bugs in GNU libiconv.
>
> Ok

Thanks for giving it a try. But although you CCed bug-gnu-libiconv, your message
did not reach the list (but Charles' one and Eric's one did). I guess this is
because the cygwin.com mail server refuses to deliver to corinna-cygwin,
therefore the spam detection at gnu.org recognized your sending address as a
spammer's one. This makes it hard for me to detect that you replied to me,
since I'm not reading the cygwin mailing list on a regular basis.

> > I don't think defining __STDC_ISO_10646__
> > is compliant with ISO C 99 in this situation.
> > ...
> I don't read that from your above quote.  The core is that the *type*
> wchar_t is a *coded* *representation* of the characters defined in
> 10646.

OK.

> > What is the Cygwin wchar_t[] encoding? Is it UTF-16, like on Win32?
> Yes.
> ...
> yes, for the forseeable future, Cygwin will define wchar_t == UTF-16.

Thanks for confirming it. I've started thinking about how gnulib can
cope with it, now.

> I've put a lot of effort in 2009 and early 2010 to make the wchar_t
> representation in Cygwin and newlib as much Unicode 5.2 compatible as
> possible.  Even the wcrtomb and mbrtowc functions in newlib are capable
> of dealing with UTF-16 surrogates.

I appreciate your effort on internationalization of Cygwin. You went as
far as you could get with the given choice of wchar_t. It's just a fact
that the <wctype.h> functions and wcwidth() cannot work right when wchar_t[]
is UTF-16. And these functions are the only reasons why gnulib and coreutils
code uses wide characters strings at all.

I'm not criticizing the Cygwin choice. Even if Cygwin had chosen to define
'wchar_t' to a 32-bit type, the same problem would have remained for mingw
programs running in UTF-8 or GB18030 locales. (I understand that such
locales exist in Windows 7.)

> I don't quite grok the code at this point:
> 
>   #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__
>       if (sizeof(wchar_t) == 4) {
>         index = ei_ucs4internal;
>         break;
>       }
>       if (sizeof(wchar_t) == 2) {
>         index = ei_ucs2internal;
>         break;
>       }
>       if (sizeof(wchar_t) == 1) {
>         index = ei_iso8859_1;
>         break;
>       }
>   #endif
> ...
> I *don't* understand that you do the same for Win32.  Old
> Windows versions are using the basic UCS-2 character plane, but newer
> versions, at least since Windows XP are using UTF-16.

Thank you for this remark. I have corrected this in libiconv, and also
added support for Cygwin >= 1.7 at the same place.

> > > the application tests to convert a UTF-8 to WCHAR_T string in four
> > >   combinations of the current locale, in this order:
> > > 
> > >   - iconv_open "C",       iconv "C"
> > >   - iconv_open "C",       iconv "C.UTF-8"
> > >   - iconv_open "C.UTF-8", iconv "C"
> > >   - iconv_open "C.UTF-8", iconv "C.UTF-8"
> ...
> My testcase is a result of trying
> to build a real-life application, gencat from glibc.  For some reason
> gencat thinks it has to set the locale back to "C" in a hardcoded manner.
> 
> This works fine for glibc systems, but the invisible and, IMHO,
> intransparent behaviour of libiconv on other systems makes it pretty
> hard to understand the behaviour of an application when porting it.

I don't see this as a particular "intransparent behaviour of libiconv".
When taking code that was tested only in a single environment (glibc in this
case), you always have to make some effort to make it portable.

> > Is cygwin_conv_to_posix_path deprecated? Does it introduce limitations of
> > some kind?
>
> Like the underlying Windows functions, Cygwin 1.7 now supports paths of
> up to 32K chars.  The old cygwin_conv_to_posix_path function and it's
> friends are written with the Windows ANSI API in mind, so they only
> support paths of up to MAX_PATH == 260 chars.

Thanks for explaining. I'll try to avoid this function.

> > > The usage of a fixed table instaed of the charset.alias file in
> > > libcharset/lib/localcharset.c, function get_charset_aliases() is
> > > not good, not good at all.
> > 
> > The alternative is to have this table stored in a file charset.alias;
> > but then every package that includes the module 'localcharset' from
> > gnulib (that is, libiconv, gettext, coreutils, and many others) will
> > want to modify this file during "make install". And this causes a lot of
> > headaches to packaging systems. Therefore, on platforms which have
> > widely used packaging systems (Linux, MacOS X, Cygwin), it's better to
> > avoid the need for this file.
> 
> Now I'm puzzled.  If that's the case, why does libiconv request the
> charset.alias file on *any* other system than DARWIN7, VMS, and Windows?
> Especially on Linux?

I "optimized" only the MacOS X, VMS, and Windows OSes. It would have been
more work to optimize all versions of Solaris, FreeBSD, AIX, etc. in the
same way.

charset.alias is requested on Linux, even though it normally does not exist,
so that packagers and users have a chance to modify the behaviour.

> Additionally, the fixed, Windows-centric table in libiconv removes the
> ability of a system to define their own set of aliases.  Also,
> Cygwin/newlib already handles the Windows codepages by itself.

There are a couple of places in gnulib, coreutils, gettext, that do some
decisions based on encoding of the current locale. In these places, I want
to use a single name for each encoding and not have to list all possible
aliases that any system on the world can use for it.

If a system adds new aliases, such as e.g. Solaris uses "PCK" when it means
"Shift_JIS", this needs to be handled in localcharset.c. There is no
system defined API for resolving these aliases.

Even if Cygwin/newlib handles Windows codepage aliases in all places where
it matters for Cygwin, there are still places where it matters for gnulib,
coreutils, gettext.

> > Neither libiconv nor gettext defines or undefines _WIN32 or __WIN32__.
> > But they are prepared to either setting.
>
> Isn't that just covering a PEBKAC?  I mean, there's no good reason to
> define -mwin32 on the command line and the libiconv configure certainly
> doesn't add it.  Whoever squeezed a -mwin32 onto the GCC command line,
> or even defined -D__WIN32__ manually, deserves the result.

But such a user will then write a mail to a mailing list, and it will take
time for me (or someone else) to investigate and answer it. By writing
  #if (defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__
I avoid this potential problem.

Thanks again for your reply and for the hint to the bug in libiconv's code.

Bruno
-- 
In memoriam Carl Friedrich Goerdeler <http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Follow-Ups:
- Re: Bug in libiconv?
  - From: Corinna Vinschen
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]