This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: charset changes

From: Thomas Wolff <towo at towo dot net>
To: cygwin-developers at cygwin dot com
Date: Sat, 06 Feb 2010 01:31:42 +0100
Subject: Re: charset changes
References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com>

Andy Koppe schrieb:

- .gb18030 (in zh_CN.gb18030)
it's supported by Windows XP and later. Maybe we should add it after 1.7.2?
I doubt whether it's possible to support it correctly. GB18030 is a monster of an encoding, with 1-byte, 2-byte and 4-byte character codes and a huge lookup table to map between GB18030 codepoints and Unicode.

It's kind of a monster but doesn't need a huge table. It's a superset of GBK and most of the superset range can be transformed from/to Unicode algorithmically. I do handle GB18030 in mined; if you point me to where in newlib this is handled, I may try a patch.

While some of the more exotic locales and encodings I had listed can faithfully be ignored for a good locale implementation, GB18030 is somewhat official in China, its support was even declared mandatory for any software system to be acceptable for official purposes or something like that; I don't know whether this is still viewed strictly but I guess it's worth handling it.

The first issue is that MultiByteToWideChar doesn't distinguish
between incomplete and invalid sequences, which is needed to correctly
implement mbrtowc. Say we've seen three bytes, and MultiByteToWideChar
returns zero when looking at those. That could mean that we're still
missing the fourth byte, in which case we should return -2, or it
could mean the sequence is already invalid, in which case we should
return -1.
Hence I suspect Cygwin would need to do its own parsing of GB18030 and
only hand complete sequences over to MultiByteToWideChar to map them
to Unicode.
But having done that, there's another problem: four-byte GB18030 sequences may map both to BMP and non-BMP Unicode codepoints. With Cygwin's wchar being 16-bit, this means that two wchars may have to be returned for one GB18030 sequence. Yet mbrtowc can only return one wchar, and unlike with UTF-8, there's no way to tell before the last byte whether two wchars are needed. I don't see a way to address that without bending the mbrtowc spec.

This sounds tricky; my encoding support in mined is not based on the wchar functions, so it may not be straightforward to map it into newlib, but I'll see. On the other hand, other systems do handle it, too, so there is an open source solution...

Thomas

Follow-Ups:
- Re: charset changes
  - From: Andy Koppe

References:
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Corinna Vinschen
- Re: charset changes
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]