charset changes

Thomas Wolff towo@towo.net
Fri Feb 5 16:29:00 GMT 2010


On 23.01.2010 12:05, Andy Koppe wrote:
> I'm in awe at Corinna's latest locale changes. Getting closer and
> closer to the real thing.
>    
Me too.
> A couple of points:
> ...
>    
And here is my couple of points, after some checking:

I found the following inconsistencies, and since the agreed strategy 
seems to be to prefer Linux compatibility over Windows mapping,
I think especially the first group of a few incompatible mappings should 
be fixed before the 1.7.2 release.


------------------------------------------------------------------------
These locales have inconsistent encodings:
Locale  Linux           Cygwin
et_EE   ISO-8859-1      ISO-8859-15
ja_JP.sjis SHIFT_JIS    CP932
ka_GE   GEORGIAN-PS     UTF-8
kk_KZ   PT154           ISO-8859-5
sr_CS   ISO-8859-5      UTF-8
uz_UZ   ISO-8859-1      UTF-8
zh_CN   GB2312          GBK
zh_HK   BIG5-HKSCS      BIG5
zh_SG   GB2312          GBK

Notes:
- SHIFT_JIS -> CP932 has been discussed extensively and I think it's OK
- GB2312 -> GBK is basically a superset, should be OK too
- zh_HK is the dedicated Hongkong locale, so should use the Hongkong 
extension
- With respect to other differences above, linux has these two 
distinguished locales:
         et_EE.iso885915 ISO-8859-15
         uz_UZ@cyrillic  UTF-8
- getlocale -a lists the following twice, without indicating a difference:
         sr_SP
         sr_BA
         az_AZ
         se_FI
         uz_UZ (see above)


------------------------------------------------------------------------
Also, some generic encoding suffixes are not handled:
- .iso885915 and .iso8859-15 (cygwin only recognizes .iso-8859-15 and 
its capital)
- .koi8r (cygwin only recognizes .koi8-r and .KOI8-R)
- .koi8u (cygwin only recognizes .koi8-u and .KOI8-U)
- .tcvn (in vi_VN.tcvn)
- .gb18030 (in zh_CN.gb18030)
- .eucjp (in ja_JP.eucjp)
- .euctw (in zh_TW.euctw)
   (Maybe the latter lack Windows support or depend on Windows 
configuration...)
- .koi8t
- .armscii8
- .big5hkscs
- .gb2312
- .georgianps
- .pt154
- .ujis (-> EUC-JP)


------------------------------------------------------------------------
These locales are not known or handled on cygwin at all:
aa_DJ   ISO-8859-1
aa_ER   UTF-8
aa_ET   UTF-8
am_ET   UTF-8
an_ES   ISO-8859-15
ar_IN   UTF-8
ar_SD   ISO-8859-6
ast_ES  ISO-8859-15
ber_DZ  UTF-8
ber_MA  UTF-8
bn_BD   UTF-8
bo_CN   UTF-8
bo_IN   UTF-8
br_FR   ISO-8859-1
byn_ER  UTF-8
ca_AD   ISO-8859-15
ca_FR   ISO-8859-15
ca_IT   ISO-8859-15
crh_UA  UTF-8
csb_PL  UTF-8
de_BE   ISO-8859-1
dz_BT   UTF-8
el_CY   ISO-8859-7
en_AG   UTF-8
en_BE   ISO-8859-1
en_BW   ISO-8859-1
en_DK   ISO-8859-1
en_HK   ISO-8859-1
en_IN   UTF-8
en_NG   UTF-8
en_SG   ISO-8859-1
es_US   ISO-8859-1
fur_IT  UTF-8
fy_DE   UTF-8
ga_IE   ISO-8859-1
gd_GB   ISO-8859-15
gez_ER  UTF-8
gez_ET  UTF-8
gv_GB   ISO-8859-1
ha_NG   UTF-8
hne_IN  UTF-8
hsb_DE  ISO-8859-2
ht_HT   UTF-8
ig_NG   UTF-8
ik_CA   UTF-8
iu_CA   UTF-8
iw_IL   ISO-8859-8
kl_GL   ISO-8859-1
km_KH   UTF-8
ks_IN   UTF-8
ku_TR   ISO-8859-9
kw_GB   ISO-8859-1
lg_UG   ISO-8859-10
li_BE   UTF-8
li_NL   UTF-8
lo_LA   UTF-8
mai_IN  UTF-8
mg_MG   ISO-8859-15
nds_DE  UTF-8
nds_NL  UTF-8
ne_NP   UTF-8
nl_AW   UTF-8
no_NO   ISO-8859-1
nr_ZA   UTF-8
nso_ZA  UTF-8
oc_FR   ISO-8859-1
om_ET   UTF-8
om_KE   ISO-8859-1
or_IN   UTF-8
pap_AN  UTF-8
pa_PK   UTF-8
ru_UA   KOI8-U
rw_RW   UTF-8
sc_IT   UTF-8
sd_IN   UTF-8
shs_CA  UTF-8
sh_YU   ISO-8859-2
sid_ET  UTF-8
si_LK   UTF-8
so_DJ   ISO-8859-1
so_ET   UTF-8
so_KE   ISO-8859-1
so_SO   ISO-8859-1
ss_ZA   UTF-8
st_ZA   ISO-8859-1
tg_TJ   KOI8-T
ti_ER   UTF-8
ti_ET   UTF-8
tig_ER  UTF-8
tk_TM   UTF-8
tl_PH   ISO-8859-1
tr_CY   ISO-8859-9
ts_ZA   UTF-8
ug_CN   UTF-8
ve_ZA   UTF-8
wa_BE   ISO-8859-1
wo_SN   UTF-8
yi_US   CP1255
yo_NG   UTF-8


------------------------------------------------------------------------
And finally, some systems (e.g. Fedora) maintain a number of full-word 
locales (locale aliases?) that are not known on cygwin either (maybe not 
harmful):
(Note: non-ASCII letters in some of the locale names on those systems 
are in 8-bit, Latin-1)
bokmal  ISO-8859-1
bokmål  ISO-8859-1
catalan ISO-8859-1
croatian        ISO-8859-2
czech   ISO-8859-2
danish  ISO-8859-1
dansk   ISO-8859-1
deutsch ISO-8859-1
dutch   ISO-8859-1
eesti   ISO-8859-1
estonian        ISO-8859-1
finnish ISO-8859-1
français        ISO-8859-1
french  ISO-8859-1
galego  ISO-8859-1
galician        ISO-8859-1
german  ISO-8859-1
greek   ISO-8859-7
hebrew  ISO-8859-8
hrvatski        ISO-8859-2
hungarian       ISO-8859-2
icelandic       ISO-8859-1
italian ISO-8859-1
japanese        EUC-JP
korean  EUC-KR
lithuanian      ISO-8859-13
norwegian       ISO-8859-1
nynorsk ISO-8859-1
polish  ISO-8859-2
portuguese      ISO-8859-1
romanian        ISO-8859-2
russian ISO-8859-5
slovak  ISO-8859-2
slovene ISO-8859-2
slovenian       ISO-8859-2
spanish ISO-8859-1
swedish ISO-8859-1
thai    TIS-620
turkish ISO-8859-9


------------------------------------------------------------------------
Thomas



More information about the Cygwin-developers mailing list