More about charsets

Andy Koppe
Sat Mar 27 16:11:00 GMT 2010

Corinna Vinschen:
> while looking into the GB18030 issue once again, I found that we still
> may have two holes which might be important to support.
> - GB2312 aka EUC-CN
>  We already support GBK, codepage 936.  GB2312/EUC-CN is a subset
>  of GBK and apparently GBK is often used while still labeled as
>  GB2312.  See the discussion here:
>  So the question is, should we just allow GB2312 and EUC-CN as
>  codeset names, but use the GBK conversion functions for them?

Might as well. As you saw, mintty already does that. Thomas Wolff's
mined goes even further and handles both GB2312 and GBK with its
GB18030 codec, because GBK is a subset of GB18030.

>  Otherwise, there's also a codepage 51936, which is called EUC-CN
>  in the list at
>  I didn't test it, but it appears to be the real GB2312.  I don't
>  know if it really makes sense to make the difference, though.

Also, it isn't available on any Windows I've tried.

> - EUC-TW
>  There's a codepage 51950 which appears to be something like EUC-TW.
>  I just found this, though:
>  Andy, is that a general rule?  Or did you test on XP and the codepage
>  was just not installed, by any chance?

It doesn't show up as an option on XP, and I've just tried it again on
Windows 7, where codepages are no longer optional. Doesn't work. I
think I'd read somewhere that 51950 is only available for .Net
programs, but unfortunately I can't find that again. I guess it's
possible that Chinese Windows versions do support it anyway, although
Wikipedia describes EUC-TW as "rarely used".

> We certainly have other holes as well, but for OS usage I don't see
> any other codeset which would be that important.

I agree.


More information about the Cygwin-developers mailing list