Bug in libiconv?

Corinna Vinschen corinna-cygwin@cygwin.com
Sat Jan 29 16:02:00 GMT 2011


On Jan 28 23:12, Bruno Haible wrote:
> Hi Corinna and Chuck,
> 
> Please CC the bug-gnu-libiconv mailing list when discussing possible
> bugs in GNU libiconv.

Ok, no worries.  However, please remove my mail account from the CC.
I'm reading the cygwin ML anyway, so I don't need dups to my private
email account.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00292.html>:
> 
> > the application tests to convert a UTF-8 to WCHAR_T string in four
> >   combinations of the current locale, in this order:
> > 
> >   - iconv_open "C",       iconv "C"
> >   - iconv_open "C",       iconv "C.UTF-8"
> >   - iconv_open "C.UTF-8", iconv "C"
> >   - iconv_open "C.UTF-8", iconv "C.UTF-8"
> > [...]
> > Here's what happens on Cygwin:
> > 
> >   $ gcc -g -o ic ic.c -liconv
> >   $ ./ic
> >   iconv: 138 <Invalid or incomplete multibyte or wide character>
> >   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
> >   iconv: 138 <Invalid or incomplete multibyte or wide character>
> >   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
> >   iconv: 138 <Invalid or incomplete multibyte or wide character>
> >   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
> >   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 480
> 
> On glibc systems, the encoding "WCHAR_T" is equivalent to "UCS-4" with machine
> dependent endianness and alignment. In particular it is independent of the
> locale. That explains the first set of results.
> 
> In libiconv, on systems which don't define __STDC_ISO_10646__, the encoding
> "WCHAR_T" is equivalent to wchar_t[], that is, dependent on the locale.
> Changing the locale encoding after allocating an iconv_t from or to "WCHAR_T"
> yields undefined behaviour. That explains the second set of results.

IMHO this is undesired behaviour.  My testcase is a result of trying
to build a real-life application, gencat from glibc.  For some reason
gencat thinks it has to set the locale back to "C" in a hardcoded manner.

This works fine for glibc systems, but the invisible and, IMHO,
intransparent behaviour of libiconv on other systems makes it pretty
hard to understand the behaviour of an application when porting it.

So it's good that Cygwin will define __STDC_ISO_10646__ in future.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>:
> 
> > I defined __STDC_ISO_10646__ for Cygwin 1.7.8 yesterday.
> 
> What is the Cygwin wchar_t[] encoding? Is it UTF-16, like on Win32? The

Yes.  It makes a lot of sense, doesn't it?  Otherwise we would constantly
have to convert twice, from MB to wchar_t and then to UTF-16 for system
access.

> documentation is silent about it. I had expected to find some word about it
> in <http://cygwin.com/cygwin-api/compatibility.html#std-susv4>
> or <http://cygwin.com/cygwin-api/std-notes.html>.

Hmm, I thought this would go without saying, given that sizeof(wchar_t)
is 2 for as long as wchar_t is a supported type in Cygwin at all.  Maybe
a hint in the docs wouldn't be too bad an idea...

> In any case, sizeof (wchar_t) == 2. I don't think defining __STDC_ISO_10646__
> is compliant with ISO C 99 in this situation. ISO C 99 section 6.10.8.(2) says:
> 
>   __STDC_ISO_10646__
>           An integer constant of the form yyyymmL (for example,
>           199712L), intended to indicate that values of type wchar_t are the
>           coded representations of the characters defined by ISO/IEC 10646,
>           along with all amendments and technical corrigenda as of the
>           specified year and month.
> 
> But when characters outside the basic plane, such as
> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
> (Or maybe I'm underestimating what "coded representations" means...?)

I don't read that from your above quote.  The core is that the *type*
wchar_t is a *coded* *representation* of the characters defined in
10646.  At no point it says that a single wchar_t value must represent a
single character from 10646.  So I take it that UTF-16 is a valid, coded
representation of the characters from 10646.

I've put a lot of effort in 2009 and early 2010 to make the wchar_t
representation in Cygwin and newlib as much Unicode 5.2 compatible as
possible.  Even the wcrtomb and mbrtowc functions in newlib are capable
of dealing with UTF-16 surrogates.

However, given that Windows XP basically only supports the charset from
Unicode 4.0, and given that Cygwin's support for east-asian double and
triple byte codesets (Big5, GBK, eucKR, eucJP, and a SJIS/CP932 bastard)
still requires the underlying Windows conversion functions, I've set
__STDC_ISO_10646__ to a value which reflects Unicode 4.0 (200305L) for
Cygwin 1.7.8.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00357.html>:
> 
> >   #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__)
> > This should be
> > ...
> >   #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__ || defined __CYGWIN__
> 
> That makes sense if Cygwin guarantees that from now on and in the future,
> the wchar_t encoding will always be UTF-16. Is this the case?

We have no reason to change that.  We could have done that when
introducing Cygwin 1.7, but it would not only have broken backward
conpatibility with existing applications, it would also have required to
change GCC in a backward-incompatible way.  And given that the
underlying real OS is using UTF-16 for wchar_t anyway, it was the
natural choice to keep up with sizeof(wchar_t) == 2.

So, yes, for the forseeable future, Cygwin will define wchar_t == UTF-16.

But note Charles email.  I don't think it's correct to use the above

  #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__ || defined __CYGWIN__

*if* you want to to keep backward compatibility with Cygwin 1.5.

Btw., I don't quite grok the code at this point:

  #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__
      if (sizeof(wchar_t) == 4) {
        index = ei_ucs4internal;
        break;
      }
      if (sizeof(wchar_t) == 2) {
        index = ei_ucs2internal;
        break;
      }
      if (sizeof(wchar_t) == 1) {
        index = ei_iso8859_1;
        break;
      }
  #endif

Given your interpretation of the definition of __STDC_ISO_10646__, I
understand why you use UCS-2 as wchar_t representation if
sizeof(wchar_t) == 2.

However, I *don't* understand that you do the same for Win32.  Old
Windows versions are using the basic UCS-2 character plane, but newer
versions, at least since Windows XP are using UTF-16.

So, here's the question.  Is ei_ucs2internal really UCS-2, or is it
UTF-16?  If the first, isn't it a bug to treat Windows as a UCS-2
system?

In general, shouldn't there be another choice to distinguish wchar_t ==
UCS-2 from wchar_t == UTF-16 system?  I don't see that at all in
libiconv.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>:
> 
> > Why on earth is libiconv on Cygwin using Windows functions in some
> > places?
> 
> So that I could reuse the essentially same code on Cygwin as on native Win32.
> 
> Charles has submitted a patch on this topic to bug-gnulib; I will handle it.

Thanks.  From my POV it makes sense to get rid of as much Windowisms as
possible on Cygwin.

> > the old cygwin_conv_to_posix_path function as well.
> 
> Is cygwin_conv_to_posix_path deprecated? Does it introduce limitations of
> some kind?

Like the underlying Windows functions, Cygwin 1.7 now supports paths of
up to 32K chars.  The old cygwin_conv_to_posix_path function and it's
frineds are written with the Windows ANSI API in mind, so they only
support paths of up to MAX_PATH == 260 chars.

However, given that you can use /proc/self/maps for the same task,
there's no reason to use these functions at all.  Just let Cygwin do
it's stuff internally.

> > The usage of a fixed table instaed of the charset.alias file in
> > libcharset/lib/localcharset.c, function get_charset_aliases() is
> > not good, not good at all.
> 
> The alternative is to have this table stored in a file charset.alias;
> but then every package that includes the module 'localcharset' from
> gnulib (that is, libiconv, gettext, coreutils, and many others) will
> want to modify this file during "make install". And this causes a lot of
> headaches to packaging systems. Therefore, on platforms which have
> widely used packaging systems (Linux, MacOS X, Cygwin), it's better to
> avoid the need for this file.

Now I'm puzzled.  If that's the case, why does libiconv request the
charset.alias file on *any* other system than DARWIN7, VMS, and Windows?
Especially on Linux?

Additionally, the fixed, Windows-centric table in libiconv removes the
ability of a system to define their own set of aliases.  Also,
Cygwin/newlib already handles the Windows codepages by itself.

> Additionally, on Win32 systems relocatability
> is a must, and the code to compute the location of charset.alias from
> the location of libiconv.dll would be overkill.

Why?  Why isn't it overkill to do it on Linux and others, but why is it
overkill on Windows or, FWIW, Darwin7 and VMS?  Sure, it's faster to use
a fixed alias table than to read a file, but neither is the mechanism to
fetch the file location so much slower on these systems, nor is there a
reason that other systems get the additional flexibility which you deny
those three systems.

As one of the core Cygwin maintainers I prefer that external libs from
the POSIX world are built with as few Windowisms, and with as few #ifdef
__CYGWIN__ tweaks as possible.  That's the goal we're working for.  If
some POSIXism don't work, start with complaining here.  Either we can
make it work, or we can discuss the least intrusive workaround.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00303.html>:
> 
> > It looks like there's been some bitrot with respect
> > to some of the "&& !CYGWIN" guards on WIN32.  Both libiconv and gettext,
> > IIRC, jump thru hoops to ensure that [_]*WIN32 is defined for both
> > "regular" win32 and for cygwin...which means defined(CYGWIN) guards are
> > necessary.
> 
> The reason for these "&& !defined __CYGWIN__" clauses is that - at least
> in Cygwin 1.5.x - gcc has an option that will define _WIN32 or __WIN32__.

That has nothing to do with Cygwin 1.5.  That's still an option in 
more recent GCC versions.  I have not the faintest idea why.

> So, when _WIN32 || __WIN32__ may evaluate to true on Cygwin, or it may
> evaluate to false on Cygwin. Since I don't want libiconv or gettext
> to be compiled in two possible ways on Cygwin, I add
> "&& !defined __CYGWIN__".
> 
> Neither libiconv nor gettext defines or undefines _WIN32 or __WIN32__.
> But they are prepared to either setting.

Isn't that just covering a PEBKAC?  I mean, there's no good reason to
define -mwin32 on the command line and the libiconv configure certainly
doesn't add it.  Whoever squeezed a -mwin32 onto the GCC command line,
or even defined -D__WIN32__ manually, deserves the result.

> Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00332.html>:
> 
> > there ARE still bugs in libiconv on Cygwin -- specifically:
> >  - Even though iconv_open has been opened explicitely with "UTF-8" as
> >    input string, the conversion still depends on the current application
> >    codeset.  That doesn't make sense.
> 
> If the other argument to iconv_open is "CHAR" or "WCHAR_T", hence locale
> dependent, and you change the locale in between, the result is undefined
> behaviour.

Why in libiconv?  Why not in glibc?  As I wrote above, this behaviour
is most unexpected and most surprising.  It also makes porting glibc
applications so much harder for no good reason.

> >  - 'iconv_close ((iconv_t) -1);' crashes the application with a SEGV.
> 
> It's not a bug. From POSIX:2008
> <http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_open.html>
> you can infer that (iconv_t) -1 is not a "conversion descriptor". It's a
> return value used from iconv_open(), nothing more. From
> <http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_close.html>
> you can see that the argument of iconv_close() has to be a conversion
> descriptor. From the ERRORS section in the same page you can see that
> iconv_close() is not required to catch a faulty argument. Note the word
> "may", not "shall".

Ok, so it's not a bug but blessed behaviour per POSIX.  However, is it
really necessary?  An extra check doesn't cost you anything, makes
libiconv more user-friendly, and aligns the behaviour with glibc, which,
again, makes porting applications easier.  Glibc's gencat is such an
application which crashes under libiconv due to that.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple



More information about the Cygwin mailing list