GB18030 (was: Re: charset changes)

Sat Mar 27 17:02:00 GMT 2010

On 27 March 2010 13:33, Corinna Vinschen wrote:
> On Mar 27 06:47, Andy Koppe wrote:
>> I think the conclusion from all this is that approach 2 is the least
>> broken way to handle GB18030: when encountering a 4-byte sequence that
>> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
>> high surrogate and report that one byte less than actually seen has
>> been consumed. On the next mbtowc call, ignore the input, write the
>> low surrogate, and report that 1 byte has been consumed.
>>
>> As mentioned, this breaks the mbtowc spec when bytes are fed in
>> one-by-one, because in that case zero needs to be returned  after the
>> high surrogate, yet zero is meant to signal string end. An application
>> that's aware of that can work around it by checking whether the wide
>> character that's written actually is null, but in others it may cause
>> truncated strings. Fortunately, the mbstowcs implementation isn't
>> affected by this, because that always passes as many bytes as possible
>> to mbtowc, i.e. the incorrect zero return can't occur there.
>>
>> The MultiByteToWideChar() function doesn't have a way to tell
>> incomplete from invalid sequences, which is needed to decide whether
>> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
>> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
>> that as a one-byte invalid sequence followed by the digit '3'.
>
> Huh?  How did you test that?  AFAIK MultiByteToWideChar, it doesn't
> tell you how many and which bytes it treated as valid substring.

On Vista and 7, if you pass those two bytes to MultiByteToWideChar,
you get back the codepage's UnicodeDefaultChar followed by the digit
'3'. XP did something else, but I can't remember exactly what.

>> Therefore I think the best thing to do is to manually parse GB18030
>> sequences, which is fairly straightforward, and only hand complete
>> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
>> I have a go at that?
>
> I would really be glad.  You'd just create two functions __gb18030_mbtowc
> and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
> setlocale_r.  Oh, and then there's check_codepage in nlsfuncs.cc which
> needs to test if codepage 54936 is installed.
>
> However, here's a problem.  Adding these functions is non-trivial code
> and requires a copyright assignment... sigh.

How about implementing __gb18030_mbtowc/wctomb in newlib, which would
handle all the mbstate stuff, with the actual encoding and decoding
factored out into functions like this:

size_t __gb18030_encode(char *dst, const wchar_t *src, size_t
src_len): Pass in one codepoint, consisting of one or two wchars
(always one in case of a 32-bit wchar_t). Return the length of the
resulting multibyte sequence.

size_t __gb18030_decode(wchar_t *dst, const char *src, size_t
src_len): Pass in a valid multibyte sequence. Return the number of
wchars needed to represent it.

On Cygwin, these would be straightforward wrappers around
WideCharToMultibyte and MultibyteToWideChar with codepage 54936,
implemented in winsup. For other newlib targets, we could take a
similar approach as with doublebyte charsets, where multibyte
sequences are mapped to a non-Unicode wchar_t representation by simply
packing the bytes into the wchar_t.

Andy