This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]


19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
> 
> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1  ICU 50.1.2

I've played a little with uconv and unfortunately it does not look good
to me.

It does not have any fallback transliteration to plain ASCII.  When it says
that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
charset does not have this character then crashes:

$ echo Шема  | uconv -f UTF-8 -t ASCII -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
�ema
$ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
ISO-8859-2 -t UTF-8
Šema

It seems to follow ISO 9 (GOST 7.79) System A.  However, the transliteration
of the hard sign is rather strange:

$ echo нъе  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
nʺe

The above was correct but:

$ echo НЪЕ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin          
Nʺ̱E
$ echo Ъ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
ʺ̱
$ echo Ъ  | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
0000000    feff    02ba    0331    000a                                
0000008

So this generates:
02BA  MODIFIER LETTER DOUBLE PRIME
0331  COMBINING MACRON BELOW

There is are more transliteration methods, for example Russian-Latin/BGN:

$ echo Шема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Shema
$ echo Схема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Skhema

Converting 'х' to 'kh' seems to be common in English transliteration but
it does not follow any ISO standard.

$ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
KHA kha

This means that the choice whether a digraph in the output should be
all uppercase or maybe upper+lower is context based, something which we
probably cannot implement.  But definitely a good thing.

Two more tests:

$ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Yeshchë
$ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Conversion from Unicode to codepage failed at output byte position 6.
Unicode: 00eb Error: Invalid character found

So the output is not plain ASCII.

$ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
ye zhe le ne

Again this means that transliteration of 'е' is context based:
it is 'ye' in the beginning of a word and 'e' otherwise.

The version which I've tested:

$ uconv -V
uconv v2.1  ICU 60.2

It seems that uconv will not be a good hint about transliterating
to plain ASCII.

Also, the difference between uconv and iconv is that we can provide
multiple transliterations for any source character but we can't group
them into standards so we can't tell iconv to use this or another
system.  It will just choose the best fitting the current output
character set and the only thing we can choose is the locale.

This makes me think: should we add a locale like ru_RU@SystemA or
ru_RU@SystemB?

Regards,

Rafal


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]