charset changes

Corinna Vinschen
Fri Feb 5 21:51:00 GMT 2010

On Feb  5 17:28, Thomas Wolff wrote:
> On 23.01.2010 12:05, Andy Koppe wrote:
> >I'm in awe at Corinna's latest locale changes. Getting closer and
> >closer to the real thing.
> Me too.

Thanks.  I'm still mulling over the LC_MESSAGES problem.  The
information is just not available in Windows so I assume we need a
file-based solution.  But that's certainly nothing for 1.7.2.

> I found the following inconsistencies, and since the agreed strategy
> seems to be to prefer Linux compatibility over Windows mapping,
> I think especially the first group of a few incompatible mappings
> should be fixed before the 1.7.2 release.
> ------------------------------------------------------------------------
> These locales have inconsistent encodings:
> Locale  Linux           Cygwin
> et_EE   ISO-8859-1      ISO-8859-15

In the latest glibc-2.11, per the localedate/locales/et_EE file,
there's only one charset, ISO-8859-15.  I have no glibc-2.11 based
system running, on Fedora 11 (glibc-2.10) it's ISO-8859-1.  Hmm.

> ka_GE   GEORGIAN-PS     UTF-8

We don't have an implementation of GEORGIAN-PS.  If you provide one for
newlib, I'll take it.  The problem is, how to integrate it into the
existing model which only has ISO and CPxxx codeset arrays for ctype and
wide char conversion?  Faking a non-existant Windows codepage?  That's
probably the easiest solution.

> kk_KZ   PT154           ISO-8859-5

Same here.

> sr_CS   ISO-8859-5      UTF-8

Doesn't exist in newer glibc since it's superseded by sr_RS and sr_ME.
I just mapped it to sr_RS.  Is it really worth to special case given
that it's outdated?  Incidentally, on Fedora 11 you get ANSI_X3.4-1968.

> uz_UZ   ISO-8859-1      UTF-8

Thanks, fixed.

> zh_HK   BIG5-HKSCS      BIG5
> - zh_HK is the dedicated Hongkong locale, so should use the Hongkong
> extension

How?  This special variation of Big5 charset isn't supported by Windows
and we need Windows support for multibyte charsets other than UTF-8.
Per MSDN (
codepage 950 is "ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR,
PRC); Chinese Traditional (Big5)".  That has to be sufficent, unless
you provide a Big5/Big5-HKSCS multibyte <-> Unicode conversion with a
Cygwin-compatible license.

> - With respect to other differences above, linux has these two
> distinguished locales:
>         et_EE.iso885915 ISO-8859-15

All the .charset variants are automatically available.  There's no
special code required.  If you like, just specify de_DE.KOI8-U.

>         uz_UZ@cyrillic  UTF-8

Added and documented (together with tt_RU@iqtelif).

> - getlocale -a lists the following twice, without indicating a difference:
>         sr_SP
>         sr_BA
>         az_AZ
>         se_FI
>         uz_UZ (see above)

Yeah, that's how the very simple mechanism works.  On W7 there are even
more duplicates.  I didn't want to make it more complicated than
necessary.  After all it's job is just to provide what's available on
the system.

> ------------------------------------------------------------------------
> Also, some generic encoding suffixes are not handled:
> - .iso885915 and .iso8859-15 (cygwin only recognizes .iso-8859-15
> and its capital)
> - .koi8r (cygwin only recognizes .koi8-r and .KOI8-R)
> - .koi8u (cygwin only recognizes .koi8-u and .KOI8-U)

I just applied code to newlib to allow to specify iso-8859 and koi8
charsets without dashes.

> - .tcvn (in vi_VN.tcvn)

Codeset not supported.

> - .gb18030 (in zh_CN.gb18030)

Ditto.  However, it's supported by Windows XP and later.  Maybe we
should add it after 1.7.2?

> - .eucjp (in ja_JP.eucjp)

This one *is* recognized by newlib, same as euckr/euc-kr.

> - .euctw (in zh_TW.euctw)

Codeset not supported.  Wikipedia claims that EUC-TW isn't widely used
and Big5 is much more common in TW.

>   (Maybe the latter lack Windows support or depend on Windows
> configuration...)
> - .koi8t
> - .armscii8
> - .big5hkscs
> - .gb2312
> - .georgianps
> - .pt154

None of them is supported. Yet!  As far as they are singlebyte charsets
we should be able to add them easily by providing ctype and widechar
conversion tables to newlib.  Care to contribute?

> - .ujis (-> EUC-JP)

That's just another name for euc-jp?  Let's ignore that for now.

> ------------------------------------------------------------------------
> These locales are not known or handled on cygwin at all:

As documented, with 1.7.2 we start to support only locales which are
also supported by the underlying Windows (with the weird sr_SP/CS/RS/ME
exception).  It doesn't make sense to support locales for which the
underlying Windows has no locale-specific LC_COLLATE/LC_MONETARY/
LC_NUMERIC/LC_TIME information available.  That's the reason I provided
getlocale.exe, so that you can find out which locales are supported by
your Windows.

And, btw., your list is not quite correct.  I don't know on which
Windows you tested that, but on Windows 7 the following locales *are*

> am_ET   UTF-8
> bn_BD   UTF-8
> bo_CN   UTF-8
> br_FR   ISO-8859-1
> en_IN   UTF-8
> en_SG   ISO-8859-1
> es_US   ISO-8859-1
> ga_IE   ISO-8859-1
> gd_GB   ISO-8859-15
> ha_NG   UTF-8
> hsb_DE  ISO-8859-2
> ig_NG   UTF-8
> iu_CA   UTF-8
> kl_GL   ISO-8859-1
> km_KH   UTF-8
> lo_LA   UTF-8
> ne_NP   UTF-8
> no_NO   ISO-8859-1

not, but nn_NO is.  If no_NO really makes sense, we could map it to nn_NO.

> nso_ZA  UTF-8
> oc_FR   ISO-8859-1
> or_IN   UTF-8
> rw_RW   UTF-8
> si_LK   UTF-8
> tg_TJ   KOI8-T
> tk_TM   UTF-8
> ug_CN   UTF-8
> wo_SN   UTF-8
> yo_NG   UTF-8
> ------------------------------------------------------------------------
> And finally, some systems (e.g. Fedora) maintain a number of
> full-word locales (locale aliases?) that are not known on cygwin
> either (maybe not harmful):

That's something for the far future, I guess.

Thanks for your input,

Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

More information about the Cygwin-developers mailing list