Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Mon Sep 28 12:39:00 GMT 2009

2009/9/28 Corinna Vinschen:
>> Oh, and I thought of one more thing that won't roundtrip correctly
>> from Unix to Windows and back: a high surrogate directly followed by a
>> low surrogate, because they'll combine into a non-BMP codepoint
>> represented by a 4-byte sequence. That's near-impossible to happen by
>> chance though.
>
> There is no chance to do that right.  But I'm willing to stick to
> this trade-off since, as you wrote, it's near-impossible that somebody
> created that filename by chance.

Hmm. But what if Java or Oracle or some other CESU-8 degenerate did
that on purpose?

Just in case you're not yet completely sick of this, here's how I
think it could be done:
- Keep treating surrogate codepoints in UTF-8 as illegal.
- Go for the F0xx encoding for invalid bytes in filenames. Hence the
three bytes of a CESU-8 surrogate will turn into three F0xx, which
will round-trip correctly.
- Encode lone surrogates on the Windows side as ^X sequences. The only
issue here is that the standard __utf8_mbtowc/wctomb could not be used
to do that.

Andy