1.7.0-48: [BUG] Passing characters above 128 from bash command line
Corinna Vinschen
corinna-cygwin@cygwin.com
Wed Jun 3 14:28:00 GMT 2009
On Jun 3 09:18, Edward Lam wrote:
> Corinna Vinschen wrote:
>> The question is, what do you expect? [...]
> [...]
> Wikipedia has several suggestions on how to handle invalid UTF-8 byte
> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the
> rule that uses the replacement character.
Chris implemented using the invalid code point solution. The discussion
in http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
supports this solution. What's missing so far is the way back, from
an invalid single second half of a surrogate pair in the 0xDCxx range
back to the correct byte value. I'm just looking into that.
> > How is anybody supposed to know that the file which consists
> > of the single byte 0xa9 has *any* meaning at all? Why should it be
> > the copyright sign, of all things?
>
> What I was attempting to do was to have NO conversion. In the
> real case that I into this, the "bug.exe" was the one to properly
> interpret what the byte 0xA9 meant from the command line. Yes, I know
> there are several workarounds.
The command line is always converted to UTF-16 when calling a native
Win32 application. If we don't do it (because we call CreateProcessA),
Windows would do it. As matters stand, we have to convert ourselves,
because we must call CreateProcessW. Either way, the problem persists.
We just don't know what the correct conversion is for the given input.
We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE.
>> If we default to the ANSI codepage, you will have the same problem,
>> just upside down. In both cases you will have even more problems if
>> you start using characters not available in your default codepage.
>
> This is where I disagreed with Alexey. What we're really arguing here is
> whether which default will run into the least problems for the most
> common usage. This is subjective of course.
Definitely. The "right" solution is always only right for a given value
of right. What if the user has set LANG to, say, ja_JP.eucJP? That
user of course expects that the stuff on the command line is converted
to UTF-16 using the eucJP encoding. Everything else would just be very
surprising.
What's left as questionable is the LANG=C default case. Due to the
discussion from the last month we now use UTF-8 as default encoding,
because it's the only encoding which covers all (valid) characters.
Sure, we could also convert the command line using the current ANSI
codepage as Windows does it when calling CreateProcessA in this case.
Maybe we should do that for testing? Anybody having a strong opinion
here?
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
More information about the Cygwin
mailing list