[PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

Mon Nov 19 07:14:00 GMT 2018

Hi,

On 17/11/2018 20.34, Egor Kobylkin wrote:
> 
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Ð¥'/'Ñ
' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.
> 
> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
> 
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.
> 
> The SH/Sh can be decided on either way - seems like an easy change any way.
> 
> I have a question then: isn't this more like a hack than a right thing
> to do?
> 
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?
> 
> This would contradict ISO 9.1995. (System A).
> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.
> 
> This particular rule with h/x would make sense it's own.
> But again - it would contradict the standards.
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.

Given the amount of questions above I think the way forward is to try
follow the relevant standards as closely as possible and also check what
the other implementations (i.e., uconv(1)) do. For example, checking the
case earlier mentioned case may or may not give some hints:

$ echo Ð¨ÐµÐ¼Ð°  | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Å ema
$ echo Ð¡Ñ
ÐµÐ¼Ð° | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Shema
$ uconv -V
uconv v2.1  ICU 50.1.2

Thanks,

-- 
Marko Myllynen