Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Sun Sep 27 17:01:00 GMT 2009

2009/9/27 Corinna Vinschen:
>> > Last but not least, you cannot have both, graceful handling of invalid
>> > sequences *and* a bijective relation between UTF-16 and multibyrte
>> > strings.  There's always a tradeoff.
>>
>> Correct. However, you can have correct roundtripping from any Unix
>> filename to a Windows filename and back to the same Unix filename
>> (well, with UTF-8 and singlebyte charsets anyway.
>
> What about "\xed\xb2\x80"?  That's UTF-16 0xDC80 which, if recognized
> as "special invalid byte sequence" is translated back to "\x80".

Yep, that's problematic too, which is why I was arguing against
accepting "\xed\xb2\x80" as UTF-8 in the first place, meaning it
should be treated as three invalid UTF-8 bytes, represented as:

U+DCED U+DCB2 U+DC80

But scratch that.

> I'm getting headaches.

Same here. Someone ought to be shot for UTF-16.

> What about this:  The private use area U+f0xx is already used for ASCII
> chars invalid in Windows filenames.  The same range can be used for
> invalid chars > 0x80.  This could happen unconditionally.

That's a great idea, allowing both lone surrogate support and Unix
filename transparency.

[time passes]

Nope, can't think of anything wrong with it. :)

Andy