This is the mail archive of the
mailing list for the Cygwin project.
Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
On Sep 21 19:54, Andy Koppe wrote:
> 2009/9/21 Corinna Vinschen:
> > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
> > The problem now is that readdir() will return the transposed characters
> > as if they are the original characters.
> Yep, that's where the bug is. Those 0xDC?? words represent invalid
> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
> Therefore, when converting a UTF-16 Windows filename to the current
> charset, 0xDC?? words should be treated like any other UTF-16 word
> that can't be represented in the current charset: it should be encoded
> as a ^N sequence.
How? Just like the incoming multibyte character didn't represent a valid
UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
Therefore, the ^N conversion will fail since U+DCxx can't be converted
to valid UTF-8.
> > So it looks like the current mechanism to handle invalid multibyte
> > sequences is too complicated for us. ?As far as I can see, it would be
> > much simpler and less error prone to translate the invalid bytes simply
> > to the equivalent UTF-16 value. ?That creates filenames with UTF-16
> > values from the ISO-8859-1 range.
> This won't work correctly, because different POSIX filenames will map
> to the same Windows filename. For example, the filenames "\xC3\xA4"
> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
> represents a-umlaut in 8859-1), will both map to Windows filename
> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
> called "\xC4", a readdir() would show that file as "\xC3\xA4".
Right, but using your above suggestion will also lead to another filename
in readdir, it would just be \x0e\xsome\xthing.
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Problem reports: http://cygwin.com/problems.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple