This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29

On 10.10.2018 13:22, Marko Myllynen wrote:
>> correct link
> Although I haven't checked every rule this in general looks very good
> (but see below). 

> Not sure do we want to add the few missing characters
> mentioned at,
> e.g., one instantly notices that U+0400 is missing. (I wouldn't add at
> least initially the more exotic characters, like the historic ones,
> though.) Perhaps filing a bug or two for these cases for separate
> consideration would be ok.

The question here is what should serve as their transliteration and
They are not covered by ISO9 neither by GOST 7.79. So maybe it would be
reasonable to assume there is no notable occurrence of those anywhere?

Anyway I am happy to include your specific suggestions for all and any
Unicode quartets in this form:
[Cyrillic Unicode
; ISO9 Latin Transliteration (System A) as Unicode
; Transcription (System B) as (mulitcharacter)ASCII
; name to put in %COMMENT

>> On 10.10.2018 00:40, Egor Kobylkin wrote:
>>> On 10.10.2018 00:17, Rafal Luzynski wrote:
>>>> 9.10.2018 20:34 Egor Kobylkin <> wrote:
>>>>> The culprits were the "" around the "<U0423><U0301>" (<U00DA>) and
>>>>> "<U0443><U0301>" (<U00FA>).
>>>>> It works now with
>>>>> <U0423><U0301> <U00DA>;"<U0055><U0060>"
>>>>> <U0443><U0301> <U00FA>;"<U0075><U0060>"
>>>>> [...]
>>>> I wonder why you need Cyrillic U with acute, and why you comment it
>>>> as "undefined" at all.  I know that any Cyrillic vowel may appear with
>>>> an acute accent but "the diacritic is used only in dictionaries, children's
>>>> books, resources for foreign-language learners (...)". [1]  So maybe
>>>> all vowels with an acute accent should be handled (which I think is fine)
>>>> rather than just U.
>>> I have just taken the table and
>>> implemented it on Marko's suggestion. Personally I have no opinion on
>>> what letters should be included and under what name. These funny Us just
>>> happened to be in the ISO9 table.
>>> There is no codepoint and no name for <U0423><U0301> and <U0443><U0301>
>>> in Unicode. That’s why its coming through that way from my worksheet as
>>> it does a reverse lookup on the names based on the Unicode codepoints.
>>> Manually we can change it to whatever you’d suggest in the
>>> translit_cyrillic. I just don’t know the right name.
> I'm not sure this will work, no existing rule in translit_* files
> contain two characters, I'd assume that the rule for U+0423 is applied
> first and then the below rule is never used.
> <U0423><U0301> <U00DA>;"<U0055><U0060>"
> Perhaps this should be commented out or removed altogether if it's not
> working as intended.

here is a result of my test on

U0423 0301-У́  -> U0423 0301-U
U0443 0301-у́ -> U0443 0301-u

So yes, they are not processed. I would drop them to not to have special
cases. But I am also fine with keeping them because all work is done

CYRILLIC RUSSIAN S``esh` eshhyo e`tih myagkih francuzskih bulok, da
CYRILLIC COMPLETE U0401-YO U0402-DJ U0403-G` U0404-Ye U0405-Z` U0406-I
U0407-Yi U0408-J U0409-L` U040A-N` U040B-TSH U040C-K` U040E-U` U040F-Dh
U0410-A U0411-B U0412-V U0413-G U0414-D U0415-E U0416-ZH U0417-Z U0418-I
U0419-J U041A-K U041B-L U041C-M U041D-N U041E-O U041F-P U0420-R U0421-S
U0422-T U0423-U U0423 0301-U U0424-F U0425-H U0426-C U0427-CH U0428-SH
U0429-SHH U042A-`` U042B-Y U042C-` U042D-E` U042E-YU U042F-YA U0430-a
U0431-b U0432-v U0433-g U0434-d U0435-e U0436-zh U0437-z U0438-i U0439-j
U043A-k U043B-l U043C-m U043D-n U043E-o U043F-p U0440-r U0441-s U0442-t
U0443-u U0443 0301-u U0444-f U0445-h U0446-c U0447-ch U0448-sh U0449-shh
U044A-A` U044B-y U044C-` U044D-e` U044E-yu U044F-ya U0451-yo U0452-dj
U0453-g` U0454-ye U0455-z` U0456-i U0457-yi U0458-j U0459-l` U045A-n`
U045B-tsh U045C-k` U045E-u` U045F-dh U046A-O` U046B-o` U0472-Fh U0473-fh
U0474-Yh U0475-yh U048C-E` U048D-e`  U0490-G` U0491-g` U0492-GH U0493-gh
U0494-GH U0495-gh U0496-ZH` U0497-zh` U049A-K` U049B-k` U049E-K`
U049F-k` U04A2-N` U04A3-n` U04A4-NG U04A5-ng U04A6-P` U04A7-p` U04A8-O`
U04A9-o` U04AA-C` U04AB-C` U04AC-T` U04AD-t` U04AE-U U04AF-u U04B2-H`
U04B3-h` U04B4-TCZ U04B5-tcz U04BA-SH` U04BB-SH` U04BC-CH` U04BD-ch`
U04BE-CH` U04BF-ch` U04C0-i U04C1-ZH` U04C2-zh` U04CB-CH` U04CC-ch`
U04D0-A` U04D1-a` U04D2-A` U04D3-a` U04D6-E` U04D7-e` U04D8-A` U04D9-a`
U04DC-ZH` U04DD-zh` U04DE-Z` U04DF-z` U04E0-Z` U04E1-z` U04E4-I`
U04E5-i` U04E6-O` U04E7-o` U04E8-O` U04E9-o` U04F0-U` U04F1-u` U04F2-U`
U04F3-u` U04F4-CH` U04F5-ch` U04F8-Y` U04F9-y` U2019-'

CYRILLIC RUSSIAN Съешь ещё этих мягких французских булок, да выпей же
CYRILLIC COMPLETE U0401-Ё U0402-Ђ U0403-Ѓ U0404-Є U0405-Ѕ U0406-І
U0407-Ї U0408-Ј U0409-Љ U040A-Њ U040B-Ћ U040C-Ќ U040E-Ў U040F-Џ U0410-А
U0411-Б U0412-В U0413-Г U0414-Д U0415-Е U0416-Ж U0417-З U0418-И U0419-Й
U041A-К U041B-Л U041C-М U041D-Н U041E-О U041F-П U0420-Р U0421-С U0422-Т
U0423-У U0423 0301-У́ U0424-Ф U0425-Х U0426-Ц U0427-Ч U0428-Ш U0429-Щ
U042A-ъ U042B-Ы U042C-ь U042D-Э U042E-Ю U042F-Я U0430-а U0431-б U0432-в
U0433-г U0434-д U0435-е U0436-ж U0437-з U0438-и U0439-й U043A-к U043B-л
U043C-м U043D-н U043E-о U043F-п U0440-р U0441-с U0442-т U0443-у U0443
0301-у́ U0444-ф U0445-х U0446-ц U0447-ч U0448-ш U0449-щ U044A-Ъ U044B-ы
U044C-Ь U044D-э U044E-ю U044F-я U0451-ё U0452-ђ U0453-ѓ U0454-є U0455-ѕ
U0456-і U0457-ї U0458-ј U0459-љ U045A-њ U045B-ћ U045C-ќ U045E-ў U045F-џ
U046A-Ѫ U046B-ѫ U0472-Ѳ U0473-ѳ U0474-Ѵ U0475-ѵ U048C-Ҍ U048D-ҍ  U0490-Ґ
U0491-ґ U0492-Ғ U0493-ғ U0494-Ҕ U0495-ҕ U0496-Җ U0497-җ U049A-Қ U049B-қ
U049E-Ҟ U049F-ҟ U04A2-Ң U04A3-ң U04A4-Ҥ U04A5-ҥ U04A6-Ҧ U04A7-ҧ U04A8-Ҩ
U04A9-ҩ U04AA-Ҫ U04AB-ҫ U04AC-Ҭ U04AD-ҭ U04AE-Ү U04AF-ү U04B2-Ҳ U04B3-ҳ
U04B4-Ҵ U04B5-ҵ U04BA-Һ U04BB-һ U04BC-Ҽ U04BD-ҽ U04BE-Ҿ U04BF-ҿ U04C0-Ӏ
U04C1-Ӂ U04C2-ӂ U04CB-Ӌ U04CC-ӌ U04D0-Ӑ U04D1-ӑ U04D2-Ӓ U04D3-ӓ U04D6-Ӗ
U04D7-ӗ U04D8-Ә U04D9-ә U04DC-Ӝ U04DD-ӝ U04DE-Ӟ U04DF-ӟ U04E0-Ӡ U04E1-ӡ
U04E4-Ӥ U04E5-ӥ U04E6-Ӧ U04E7-ӧ U04E8-Ө U04E9-ө U04F0-Ӱ U04F1-ӱ U04F2-Ӳ
U04F3-ӳ U04F4-Ӵ U04F5-ӵ U04F8-Ӹ U04F9-ӹ U2019-’

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]