Sat Mar 27 13:33:00 GMT 2010
On Mar 27 06:47, Andy Koppe wrote:
> I think the conclusion from all this is that approach 2 is the least
> broken way to handle GB18030: when encountering a 4-byte sequence that
> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
> high surrogate and report that one byte less than actually seen has
> been consumed. On the next mbtowc call, ignore the input, write the
> low surrogate, and report that 1 byte has been consumed.
> As mentioned, this breaks the mbtowc spec when bytes are fed in
> one-by-one, because in that case zero needs to be returned after the
> high surrogate, yet zero is meant to signal string end. An application
> that's aware of that can work around it by checking whether the wide
> character that's written actually is null, but in others it may cause
> truncated strings. Fortunately, the mbstowcs implementation isn't
> affected by this, because that always passes as many bytes as possible
> to mbtowc, i.e. the incorrect zero return can't occur there.
> The MultiByteToWideChar() function doesn't have a way to tell
> incomplete from invalid sequences, which is needed to decide whether
> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> that as a one-byte invalid sequence followed by the digit '3'.
Huh? How did you test that? AFAIK MultiByteToWideChar, it doesn't
tell you how many and which bytes it treated as valid substring.
> Therefore I think the best thing to do is to manually parse GB18030
> sequences, which is fairly straightforward, and only hand complete
> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> I have a go at that?
I would really be glad. You'd just create two functions __gb18030_mbtowc
and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
setlocale_r. Oh, and then there's check_codepage in nlsfuncs.cc which
needs to test if codepage 54936 is installed.
However, here's a problem. Adding these functions is non-trivial code
and requires a copyright assignment... sigh.
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
More information about the Cygwin-developers