1.7.0-48: [BUG] Passing characters above 128 from bash command line

Tue Jun 2 20:54:00 GMT 2009

On May 29 17:21, Edward Lam wrote:
>
> Alexey Borzenkov wrote:
> > No, the bug is not that it gets wrong number of arguments. In fact,
> > Windows has no concept of arguments, only C runtime does, which parses
> > the command line. If command line is truncated, then C runtime will
> > have missing arguments when it tries to parse it.
>
> Sorry, I had meant to comment on this previously but hit send too soon.
>
> I think the problem I'm running into is:
> - I give cygwin 1.7's bash a string that is in my system default code page.
> - cygwin 1.7 thinks the string is actually UTF-8 and tries to convert it  
> as UTF-8 into UTF-16, resulting in a truncated command line that is  
> passed to child process.
>
> Here's some more investigation:
>
> $ cat bug.c
> #include <stdio.h>
>
> int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
> {
>     int i;
>     for (i = 0; i < argc; i++)
>         wprintf(L"%d: %s\n", i, argv[i]);
>     return 0;
> }
>
> ... and compiled using MSVC ....
>
> $ ./bug arg1 "before `cat copyright.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before
>
> So note that even when I'm seems to be an UNICODE-AWARE child process,  
> I'm still getting a truncated command line. In fact, call  
> GetCommandLineW() directly seems to give a truncated command line
> as well.

The question is, what do you expect?  I know, you expect that it "just
works", but that's not as easy as you might assume, unfortunately.

Let's assume you're doing all this in a Windows console.  The character
we're talking about is a singlebyte or multibyte character with the
value 0xa9.  What exactly is this character 0xa9?

- It's the "Copyright" sign in Windows codepage 1252, the default GUI
  (ANSI) codepage for many western languages and, incidentally, in
  ISO-8859-1 and ISO-8859-15.  The Unicode value of this character is
  0xa9.

- It's the "reverse not sign" in Windows codepage 437, the default
  console (OEM) codepage on US systems.  The Unicode value is 0x2310.

- It's the "Registered trademark" sign in Windows codepage 850, the
  default OEM codepage in a couple of western european languages
  (French, German, Italian, ...).  The Unicode value is 0xae.

- It's the Cyrillic capital letter IE in Windows codepage 855, the
  default OEM codepage for languages using cyrillic characters.  The
  Unicode value is 0x0415.

Yoy get the idea.  The character 0xa9 has no meaning in itself.  It only
has a meaning when you consider the character set or codepage in which
you use this character.

When converting this character to UTF-16, the converting function has to
know the charset in which the character has been given.  The problem is,
how is Cygwin supposed to know in which codepage or charset the
character has been created?  In your case it's even more weird.  How is
anybody supposed to know that the file which consists of the single byte
0xa9 has *any* meaning at all?  Why should it be the copyright sign, of
all things?

Cygwin now defaults to UTF-8.  In UTF-8 the character value 0xa9 is an
invalid character.  The conversion function which converts the command
line fails due to an invalid character value.  Whether this is good or
bad is another problem, but fact is, Cygwin doesn't know what to do with
this value in the first place.  It doesn't know anything about the
charset used to generate the character with the value 0xa9.  So, even if
you take Cygwin out of the picture, if you create a console application
which writes the multibyte character with value 0xa9 to the console, it
will in all likelihood not be the copyright sign.  If you're printing on
a US system, the default console codepage is 437 and you get the reverse
not sign.  If you call `chcp 1252' and print again, you get the
copyright sign.

The bottom line is, whatever default we use, we're screwed in some way,
because it will cause inconvenience for one part of the users and help
the others.  That was already the case for the old
CYGWIN=codepage:{oem|ansi} environment variable setting.

If we default to the OEM charset, you will not get the expected result
for characters created using the ANSI codepage and get problems
interacting with applications using the ANSI codepage.

If we default to the ANSI codepage, you will have the same problem, just
upside down.  In both cases you will have even more problems if you
start using characters not available in your default codepage.

If we default to UTF-8, we have no problem in Cygwin to work with any
Unicode character, but you will have to take some care when interacting
with Windows applications when using non-ASCII characters.  In your case,
in which only you know that 0xa9 is meant to be the copyright char, you
should tell Cygwin which charset you want to use.  Try setting LANG to
en_US.CP1252.  Your example should work then.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/