[PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29

Egor Kobylkin egor@kobylkin.com
Wed Oct 10 12:20:00 GMT 2018


On 10.10.2018 13:22, Marko Myllynen wrote:
>> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303
> 
> Although I haven't checked every rule this in general looks very good
> (but see below). 


> Not sure do we want to add the few missing characters
> mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode,
> e.g., one instantly notices that U+0400 is missing. (I wouldn't add at
> least initially the more exotic characters, like the historic ones,
> though.) Perhaps filing a bug or two for these cases for separate
> consideration would be ok.

The question here is what should serve as their transliteration and
transcription?
They are not covered by ISO9 neither by GOST 7.79. So maybe it would be
reasonable to assume there is no notable occurrence of those anywhere?

Anyway I am happy to include your specific suggestions for all and any
Unicode quartets in this form:
[Cyrillic Unicode
; ISO9 Latin Transliteration (System A) as Unicode
; Transcription (System B) as (mulitcharacter)ASCII
; name to put in %COMMENT
].


> 
>> On 10.10.2018 00:40, Egor Kobylkin wrote:
>>> On 10.10.2018 00:17, Rafal Luzynski wrote:
>>>> 9.10.2018 20:34 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>>
>>>>> The culprits were the "" around the "<U0423><U0301>" (<U00DA>) and
>>>>> "<U0443><U0301>" (<U00FA>).
>>>>> It works now with
>>>>> % CYRILLIC UNDEFINED
>>>>> <U0423><U0301> <U00DA>;"<U0055><U0060>"
>>>>> % CYRILLIC UNDEFINED
>>>>> <U0443><U0301> <U00FA>;"<U0075><U0060>"
>>>>>
>>>>> [...]
>>>>
>>>> I wonder why you need Cyrillic U with acute, and why you comment it
>>>> as "undefined" at all.  I know that any Cyrillic vowel may appear with
>>>> an acute accent but "the diacritic is used only in dictionaries, children's
>>>> books, resources for foreign-language learners (...)". [1]  So maybe
>>>> all vowels with an acute accent should be handled (which I think is fine)
>>>> rather than just U.
>>>
>>> I have just taken the https://en.wikipedia.org/wiki/ISO_9 table and
>>> implemented it on Marko's suggestion. Personally I have no opinion on
>>> what letters should be included and under what name. These funny Us just
>>> happened to be in the ISO9 table.
>>>
>>> There is no codepoint and no name for <U0423><U0301> and <U0443><U0301>
>>> in Unicode. That’s why its coming through that way from my worksheet as
>>> it does a reverse lookup on the names based on the Unicode codepoints.
>>>
>>> Manually we can change it to whatever you’d suggest in the
>>> translit_cyrillic. I just don’t know the right name.
> 
> I'm not sure this will work, no existing rule in translit_* files
> contain two characters, I'd assume that the rule for U+0423 is applied
> first and then the below rule is never used.
> 
> % CYRILLIC UNDEFINED
> <U0423><U0301> <U00DA>;"<U0055><U0060>"
> 
> Perhaps this should be commented out or removed altogether if it's not
> working as intended.

here is a result of my test on
https://sourceware.org/bugzilla/attachment.cgi?id=11304

U0423 0301-У́  -> U0423 0301-U
U0443 0301-у́ -> U0443 0301-u

So yes, they are not processed. I would drop them to not to have special
cases. But I am also fine with keeping them because all work is done
already.

Result:
CYRILLIC RUSSIAN S``esh` eshhyo e`tih myagkih francuzskih bulok, da
vypej zhe chayu. SA`ESH` ESHHYO E`TIH MYAGKIH FRANCUZSKIH BULOK? DA
VYPEJ ZHE CHAYU!
CYRILLIC COMPLETE U0401-YO U0402-DJ U0403-G` U0404-Ye U0405-Z` U0406-I
U0407-Yi U0408-J U0409-L` U040A-N` U040B-TSH U040C-K` U040E-U` U040F-Dh
U0410-A U0411-B U0412-V U0413-G U0414-D U0415-E U0416-ZH U0417-Z U0418-I
U0419-J U041A-K U041B-L U041C-M U041D-N U041E-O U041F-P U0420-R U0421-S
U0422-T U0423-U U0423 0301-U U0424-F U0425-H U0426-C U0427-CH U0428-SH
U0429-SHH U042A-`` U042B-Y U042C-` U042D-E` U042E-YU U042F-YA U0430-a
U0431-b U0432-v U0433-g U0434-d U0435-e U0436-zh U0437-z U0438-i U0439-j
U043A-k U043B-l U043C-m U043D-n U043E-o U043F-p U0440-r U0441-s U0442-t
U0443-u U0443 0301-u U0444-f U0445-h U0446-c U0447-ch U0448-sh U0449-shh
U044A-A` U044B-y U044C-` U044D-e` U044E-yu U044F-ya U0451-yo U0452-dj
U0453-g` U0454-ye U0455-z` U0456-i U0457-yi U0458-j U0459-l` U045A-n`
U045B-tsh U045C-k` U045E-u` U045F-dh U046A-O` U046B-o` U0472-Fh U0473-fh
U0474-Yh U0475-yh U048C-E` U048D-e`  U0490-G` U0491-g` U0492-GH U0493-gh
U0494-GH U0495-gh U0496-ZH` U0497-zh` U049A-K` U049B-k` U049E-K`
U049F-k` U04A2-N` U04A3-n` U04A4-NG U04A5-ng U04A6-P` U04A7-p` U04A8-O`
U04A9-o` U04AA-C` U04AB-C` U04AC-T` U04AD-t` U04AE-U U04AF-u U04B2-H`
U04B3-h` U04B4-TCZ U04B5-tcz U04BA-SH` U04BB-SH` U04BC-CH` U04BD-ch`
U04BE-CH` U04BF-ch` U04C0-i U04C1-ZH` U04C2-zh` U04CB-CH` U04CC-ch`
U04D0-A` U04D1-a` U04D2-A` U04D3-a` U04D6-E` U04D7-e` U04D8-A` U04D9-a`
U04DC-ZH` U04DD-zh` U04DE-Z` U04DF-z` U04E0-Z` U04E1-z` U04E4-I`
U04E5-i` U04E6-O` U04E7-o` U04E8-O` U04E9-o` U04F0-U` U04F1-u` U04F2-U`
U04F3-u` U04F4-CH` U04F5-ch` U04F8-Y` U04F9-y` U2019-'

Source:
CYRILLIC RUSSIAN Съешь ещё этиÑ
 мягкиÑ
 французскиÑ
 булок, да выпей же
чаю. СЪЕШЬ ЕЩЁ ЭТИХ МЯГКИХ ФРАНЦУЗСКИХ БУЛОК? ДА ВЫПЕЙ ЖЕ ЧАЮ!
CYRILLIC COMPLETE U0401-Ё U0402-Ђ U0403-Ѓ U0404-Є U0405-Ð
 U0406-І
U0407-Ї U0408-Ј U0409-Љ U040A-Њ U040B-Ћ U040C-Ќ U040E-Ў U040F-Џ U0410-А
U0411-Б U0412-В U0413-Г U0414-Д U0415-Е U0416-Ж U0417-З U0418-И U0419-Й
U041A-К U041B-Л U041C-М U041D-Н U041E-О U041F-П U0420-Р U0421-С U0422-Т
U0423-У U0423 0301-У́ U0424-Ф U0425-Х U0426-Ц U0427-Ч U0428-Ш U0429-Щ
U042A-ъ U042B-Ы U042C-ь U042D-Э U042E-Ю U042F-Я U0430-а U0431-б U0432-в
U0433-г U0434-д U0435-е U0436-ж U0437-з U0438-и U0439-й U043A-к U043B-л
U043C-м U043D-н U043E-о U043F-п U0440-р U0441-с U0442-т U0443-у U0443
0301-у́ U0444-Ñ„ U0445-Ñ
 U0446-ц U0447-ч U0448-ш U0449-щ U044A-Ъ U044B-ы
U044C-Ь U044D-э U044E-ю U044F-я U0451-ё U0452-ђ U0453-ѓ U0454-є U0455-ѕ
U0456-і U0457-ї U0458-ј U0459-љ U045A-њ U045B-ћ U045C-ќ U045E-ў U045F-џ
U046A-Ѫ U046B-ѫ U0472-Ѳ U0473-ѳ U0474-Ѵ U0475-ѵ U048C-Ҍ U048D-ҍ  U0490-Ґ
U0491-Ò‘ U0492-Ò’ U0493-Ò“ U0494-Ò” U0495-Ò• U0496-Ò– U0497-Ò— U049A-Òš U049B-Ò›
U049E-Òž U049F-ÒŸ U04A2-Ò¢ U04A3-Ò£ U04A4-Ò¤ U04A5-Ò¥ U04A6-Ò¦ U04A7-Ò§ U04A8-Ò¨
U04A9-Ò© U04AA-Òª U04AB-Ò« U04AC-Ò¬ U04AD-Ò­ U04AE-Ò® U04AF-Ò¯ U04B2-Ò² U04B3-Ò³
U04B4-Ò´ U04B5-Òµ U04BA-Òº U04BB-Ò» U04BC-Ò¼ U04BD-Ò½ U04BE-Ò¾ U04BF-Ò¿ U04C0-Ó€
U04C1-Ӂ U04C2-ӂ U04CB-Ӌ U04CC-ӌ U04D0-Ӑ U04D1-ӑ U04D2-Ӓ U04D3-ӓ U04D6-Ӗ
U04D7-ӗ U04D8-Ә U04D9-ә U04DC-Ӝ U04DD-ӝ U04DE-Ӟ U04DF-ӟ U04E0-Ӡ U04E1-ӡ
U04E4-Ó¤ U04E5-Ó¥ U04E6-Ó¦ U04E7-Ó§ U04E8-Ó¨ U04E9-Ó© U04F0-Ó° U04F1-Ó± U04F2-Ó²
U04F3-ӳ U04F4-Ӵ U04F5-ӵ U04F8-Ӹ U04F9-ӹ U2019-’



More information about the Libc-locales mailing list