Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Sun Sep 27 10:29:00 GMT 2009

2009/9/27 Corinna Vinschen:
> On Sep 27 07:32, Andy Koppe wrote:
>> > The __utf8_wctomb function could just create the corresponding
>> > UCS-2 values if no first half has been encountered before.  The
>> > __utf8_mbtowc function could simply allow these UCS-2 values again.
>> >
>> > That works (I just tested it) and is a small change, but is it really
>> > desirable to allow UCS-2 values in UTF-8 strings?
>> [...]
>> The pragmatic approach is tempting though, and we do have reasonable
>> grounds for it given the 16-bit wchar_t. But I think it would need to
>> work for both low and high surrogates.
>>
>> Regarding the latter, __utf8_wctomb() currently writes the first byte
>> of a four-byte sequence when it sees a high surrogate, which of course
>> it can't take back if the following codepoint isn't a low surrogate.
>> This is a problem even if lone high surrogates aren't going to be
>> supported, because that byte on its own is invalid UTF-8.
>>
>> Reading the POSIX spec, however, wctomb() is allowed to write nothing,
>> return zero, and leave the entire high surrogate to be dealt with on
>> the next call. It just says "wctomb() shall [...] return the number of
>> bytes that constitute the character corresponding to the value of
>> wchar", and unlike with mbtowc(), a return value of zero is not
>> defined to have special meaning.
>>
>> There's also room to deal with a lone high surrogate at string end:
>> "If wchar is 0, a null byte shall be stored, preceded by any shift
>> sequence needed to restore the initial shift state, and wctomb() shall
>> be left in the initial shift state."
>
> It never occured to me that wcrtomb could return 0 and the calling
> functions like wcsnrtombs would simply proceed.  I'll have a look
> to change __utf8_wctomb accordingly.

Two further thoughts on allowing lone surrogates:
- __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
high surrogate followed by a three-byte char.
- Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
roundtrip to a single-byte xx. Changing the code to something else
than DCxx wouldn't help.

Andy