This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
- From: Rafal Luzynski <digitalfreak at lingonborough dot com>
- To: Marko Myllynen <myllynen at redhat dot com>, Egor Kobylkin <egor at kobylkin dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org
- Date: Sat, 1 Dec 2018 23:07:19 +0100 (CET)
- Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
- References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com> <837001401.21346.1542406647888@poczta.nazwa.pl> <bef63562-09d1-3306-aae9-20002ccf4130@kobylkin.com> <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com>
19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
>
> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1 ICU 50.1.2
I've played a little with uconv and unfortunately it does not look good
to me.
It does not have any fallback transliteration to plain ASCII. When it says
that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
charset does not have this character then crashes:
$ echo Шема | uconv -f UTF-8 -t ASCII -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
�ema
$ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
ISO-8859-2 -t UTF-8
Šema
It seems to follow ISO 9 (GOST 7.79) System A. However, the transliteration
of the hard sign is rather strange:
$ echo нъе | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
nʺe
The above was correct but:
$ echo НЪЕ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Nʺ̱E
$ echo Ъ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
ʺ̱
$ echo Ъ | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
0000000 feff 02ba 0331 000a
0000008
So this generates:
02BA MODIFIER LETTER DOUBLE PRIME
0331 COMBINING MACRON BELOW
There is are more transliteration methods, for example Russian-Latin/BGN:
$ echo Шема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Shema
$ echo Схема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Skhema
Converting 'х' to 'kh' seems to be common in English transliteration but
it does not follow any ISO standard.
$ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
KHA kha
This means that the choice whether a digraph in the output should be
all uppercase or maybe upper+lower is context based, something which we
probably cannot implement. But definitely a good thing.
Two more tests:
$ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Yeshchë
$ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Conversion from Unicode to codepage failed at output byte position 6.
Unicode: 00eb Error: Invalid character found
So the output is not plain ASCII.
$ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
ye zhe le ne
Again this means that transliteration of 'е' is context based:
it is 'ye' in the beginning of a word and 'e' otherwise.
The version which I've tested:
$ uconv -V
uconv v2.1 ICU 60.2
It seems that uconv will not be a good hint about transliterating
to plain ASCII.
Also, the difference between uconv and iconv is that we can provide
multiple transliterations for any source character but we can't group
them into standards so we can't tell iconv to use this or another
system. It will just choose the best fitting the current output
character set and the only thing we can choose is the locale.
This makes me think: should we add a locale like ru_RU@SystemA or
ru_RU@SystemB?
Regards,
Rafal