This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]


Rafal,

Just to touch base on this, what is the best way forward? Did you get
any input/feedback on your questions below? Are you expecting input from
anyone but myself?

On the blocking issue #2: I really don’t see the connection to the uk_UA
locale that has its transliteration table inline and is explicitly
excluded from my patch. It may be revealing  another issue you have with
glibc but wouldn’t that be better addressed in a new bug?
Again, in the v10 of my patch I have removed multicharacter source
graphemes, so that issue is moot there.

If you’d like to overhaul the glibc translit system wouldn’t it be
better to commit the simple text file with the Cyrillic
translit(transcription) table first, fix the bug from the year 2006 and
then proceed from there all due diligence?

The same with having both System A and System B.  Initially I went along
with the suggestion to include the system A but it is clear now that it
doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
to set it aside for the moment and use the v10 without the system A.
That is the whole reason I have submitted it, to be superclear on that.

Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
should mitigate your concern about that issue too (somewhat, anyway).
Making it context based would also be about adding new code, see above.

Let me know if there’s anything I can help with getting more progress
with the decision

Bests,
Egor


On 16.11.18 23:17, Rafal Luzynski wrote:

> 2. I made few tests in the command line and it seems to me that the 
> transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does 
> not work and has not been working for some time already because I've
> checked some older systems as well and the result is always the same.
> I think that the reason is that uk_UA defines multiple 
> transliteration rules for "З" depending on what is the letter
> following it.  It does not seem to work.  AFAIK the reason is that
> the syntax of transliteration rules says that a single non-Latin
> character may map one or more Latin strings, each consisting of one
> or more characters. There cannot be a rule transliterating multiple
> source characters into one or multiple destination characters.  Is it
> a bug in transliteration implementation?  Or maybe in the
> specification, including POSIX standard?
> The definition of transliteration says that it is one-to-one mapping 
> of graphemes while a grapheme may be one or multiple characters. It
> does not have to be always mapping one-to-one character.  Should we 
> fix this bug first, make uk_UA transliteration work, and only then 
> add a generic Cyrillic transliteration?  Egor's patch already
> contains transliteration of "У" + combining acute accent to "Ú" which
> most probably will not work.
> 
> I still think that in the longer term all existing custom
> transliterations of Cyrillic alphabets should be ported to a
> modification of your patch.

On 01.12.18 23:07, Rafal Luzynski wrote:
> 19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> Given the amount of questions above I think the way forward is to try
>> follow the relevant standards as closely as possible and also check what
>> the other implementations (i.e., uconv(1)) do. For example, checking the
>> case earlier mentioned case may or may not give some hints:
>>
>> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Šema
>> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Shema
>> $ uconv -V
>> uconv v2.1  ICU 50.1.2
> 
> I've played a little with uconv and unfortunately it does not look good
> to me.
> 
> It does not have any fallback transliteration to plain ASCII.  When it says
> that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
> charset does not have this character then crashes:
> 
> $ echo Шема  | uconv -f UTF-8 -t ASCII -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
> �ema
> $ echo Шема  | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
> ISO-8859-2 -t UTF-8
> Šema
> 
> It seems to follow ISO 9 (GOST 7.79) System A.  However, the transliteration
> of the hard sign is rather strange:
> 
> $ echo нъе  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> nʺe
> 
> The above was correct but:
> 
> $ echo НЪЕ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin          
> Nʺ̱E
> $ echo Ъ  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> ʺ̱
> $ echo Ъ  | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
> 0000000    feff    02ba    0331    000a                                
> 0000008
> 
> So this generates:
> 02BA  MODIFIER LETTER DOUBLE PRIME
> 0331  COMBINING MACRON BELOW
> 
> There is are more transliteration methods, for example Russian-Latin/BGN:
> 
> $ echo Шема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Shema
> $ echo Схема  | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Skhema
> 
> Converting 'х' to 'kh' seems to be common in English transliteration but
> it does not follow any ISO standard.
> 
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
> 
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement.  But definitely a good thing.
> 
> Two more tests:
> 
> $ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Yeshchë
> $ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> Conversion from Unicode to codepage failed at output byte position 6.
> Unicode: 00eb Error: Invalid character found
> 
> So the output is not plain ASCII.
> 
> $ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> ye zhe le ne
> 
> Again this means that transliteration of 'е' is context based:
> it is 'ye' in the beginning of a word and 'e' otherwise.
> 
> The version which I've tested:
> 
> $ uconv -V
> uconv v2.1  ICU 60.2
> 
> It seems that uconv will not be a good hint about transliterating
> to plain ASCII.
> 
> Also, the difference between uconv and iconv is that we can provide
> multiple transliterations for any source character but we can't group
> them into standards so we can't tell iconv to use this or another
> system.  It will just choose the best fitting the current output
> character set and the only thing we can choose is the locale.
> 
> This makes me think: should we add a locale like ru_RU@SystemA or
> ru_RU@SystemB?
> 
> Regards,
> 
> Rafal
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]