Bug 2872 - Transliteration Cyrillic -> ASCII fails
Summary: Transliteration Cyrillic -> ASCII fails
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.3.6
: P2 normal
Target Milestone: 2.30
Assignee: Egor Kobylkin
URL: https://en.wikipedia.org/wiki/ISO_9
Keywords: glibc_2.29
Depends on:
Blocks:
 
Reported: 2006-07-02 08:31 UTC by Eduard Bloch
Modified: 2019-07-23 19:14 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file (407.21 KB, application/octet-stream)
2015-09-07 23:07 UTC, Egor Kobylkin
Details
translation table for transliteration of cyrillic to ascii (528 bytes, text/plain)
2015-09-07 23:10 UTC, Egor Kobylkin
Details
a version that works for localedef (884 bytes, text/plain)
2015-09-08 09:11 UTC, Egor Kobylkin
Details
multi-character transliteration table cyrillic->ascii with fallback to single-character (995 bytes, text/plain)
2015-09-08 09:57 UTC, Egor Kobylkin
Details
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with milti-character transliteration (407.81 KB, application/octet-stream)
2015-09-08 09:59 UTC, Egor Kobylkin
Details
multi-character transliteration table cyrillic->ascii with fallback to single-character (928 bytes, text/plain)
2015-09-08 10:05 UTC, Egor Kobylkin
Details
test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales (467 bytes, text/plain)
2015-09-18 16:37 UTC, Egor Kobylkin
Details
the patch adding translit_cyrillic and including it into locales (4.49 KB, patch)
2018-07-18 09:24 UTC, Egor Kobylkin
Details | Diff
screenshot of the ISO 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) (30.04 KB, image/png)
2018-10-05 10:35 UTC, Egor Kobylkin
Details
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration (1.04 MB, application/octet-stream)
2018-10-06 20:05 UTC, Egor Kobylkin
Details
Transliteration table Cyrillic->Latin with fallback to ASCII (2.67 KB, text/plain)
2018-10-06 20:07 UTC, Egor Kobylkin
Details
Transliteration table Cyrillic->Latin with fallback to ASCII (2.82 KB, text/plain)
2018-10-06 20:44 UTC, Egor Kobylkin
Details
test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales (1.23 KB, text/plain)
2018-10-06 20:49 UTC, Egor Kobylkin
Details
LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration (1.04 MB, application/octet-stream)
2018-10-08 22:46 UTC, Egor Kobylkin
Details
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration (1.04 MB, application/octet-stream)
2018-10-08 23:30 UTC, Egor Kobylkin
Details
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration (1.04 MB, application/octet-stream)
2018-10-09 10:43 UTC, Egor Kobylkin
Details
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration (1.05 MB, application/octet-stream)
2018-10-09 19:00 UTC, Egor Kobylkin
Details
Transliteration table Cyrillic->Latin with fallback to ASCII (2.87 KB, text/plain)
2018-10-09 19:06 UTC, Egor Kobylkin
Details
the patch adding translit_cyrillic and including it into locales (6.68 KB, text/plain)
2018-10-09 19:09 UTC, Egor Kobylkin
Details
test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales (1.31 KB, text/plain)
2018-10-09 19:11 UTC, Egor Kobylkin
Details
the patch adding translit_cyrillic and including it into locales (6.65 KB, patch)
2018-10-11 15:41 UTC, Egor Kobylkin
Details | Diff
Transliteration table Cyrillic->Latin with fallback to ASCII (2.86 KB, text/plain)
2018-10-11 15:43 UTC, Egor Kobylkin
Details
the patch adding translit_cyrillic and including it into locales (7.26 KB, patch)
2018-10-16 08:15 UTC, Egor Kobylkin
Details | Diff
copy of localedata/bug-iconv-trans.c for cyrillic (3.77 KB, text/plain)
2018-10-16 08:45 UTC, Egor Kobylkin
Details
Transliteration table Cyrillic->Latin with fallback to ASCII (2.87 KB, text/plain)
2018-10-17 13:53 UTC, Egor Kobylkin
Details
the patch adding translit_cyrillic and including it into locales (7.44 KB, patch)
2018-10-17 14:10 UTC, Egor Kobylkin
Details | Diff
the patch adding translit_cyrillic and including it into locales (10.16 KB, patch)
2018-11-15 08:55 UTC, Egor Kobylkin
Details | Diff
The patch adding translit_cyrillic and including it into locales (9.70 KB, patch)
2018-11-19 11:04 UTC, Egor Kobylkin
Details | Diff
Transliteration table Cyrillic-> ASCII (2.38 KB, text/plain)
2018-11-19 11:06 UTC, Egor Kobylkin
Details
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration (1.06 MB, application/octet-stream)
2018-11-19 13:54 UTC, Egor Kobylkin
Details
The patch adding Cyrillic translit to locale/C-translit.h.in (2.49 KB, patch)
2018-12-08 22:31 UTC, Egor Kobylkin
Details | Diff
LibreOffice Calc spreadsheet used to generate the Cyrillic transliteration table (1.06 MB, application/octet-stream)
2018-12-08 22:34 UTC, Egor Kobylkin
Details
The patch adding Cyrillic translit to locale/C-translit.h.in (2.24 KB, patch)
2019-01-02 18:40 UTC, Egor Kobylkin
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eduard Bloch 2006-07-02 08:31:01 UTC
Hello,

I tried to convert some text from Cyrillic (UTF-8) to ASCII, using the
//translit flag. However, it fails badly, all chars are just replaced with ?. It
seems to be independent from my current locale, I can set en_US.UTF-8 or
de_DE.UTF-8 or ru_RU.UTF-8, it still fails.

Transliteration of latin seems to work, though:

echo Müßte асдфасфд | LANG=de_DE.UTF-8 iconv -f UTF-8 -t ASCII//translit
Muesste ????????
echo Müßte асдфасфд | LANG=fr_FR.UTF-8 iconv -f UTF-8 -t ASCII//translit
Musste ????????

I had a "discussion" with a Debian maintainer of glibc, who indicated that the
problem is in the locale which controls the transliterations... but, from my
POV, there should be a default fallback when there is no other transliteration
scheme. And I remember that it has been working with glibc some months or years ago.
Comment 1 Danilo Segan 2006-07-20 10:18:44 UTC
Works for me with sr_CS locale:

$ echo 'Müßte Данило' | LANG=sr_CS.UTF-8 iconv -t ASCII//translit
Musste Danilo
Comment 2 Ulrich Drepper 2007-02-17 19:24:30 UTC
Transliteration is locale dependend, there is no way around it:

Russian/Cyrillic:  Горбачёв

German transliteration: Gorbaschow

English transliteration: Gorbatsov or Gorbatsev

If you want cyrillic transliteration for the locale you use, provide the data.
Comment 3 Eduard Bloch 2007-02-17 22:17:06 UTC
"WORKSFORME" implies that you cannot reproduce the problem, but... does
transliterating Горбачов work with English or not? (see below) If not, how can
it be "resolved"?

Or what are you trying to say with "provide the data"? That there is no data
yet? I have seen it working some years ago. With correct US-style
transliterations. If it is broken now or data has been lost, then TRANSLIT maybe
should be disabled and throw an error immediately. Currently it produces crap
and AFAICS there is no proper documentation explaining why.

The crap it creates is not even consistent within the same language/country
pair, without .UTF-8 suffix it produces more funny non-sense.

echo Горбачов | LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
????????
echo Горбачов | LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
????????
echo Горбачов | LANG=de_DE iconv -t ASCII//TRANSLIT
??? 3/4 N?????N?? 3/4 ??
echo Горбачов | LANG=en_US iconv -t ASCII//TRANSLIT
iconv: illegal input sequence at position 0

Comment 4 Ulrich Drepper 2007-02-19 00:51:59 UTC
It all works as designed given the data provided.  If you want change, provide
the data.  Otherwise go away.
Comment 5 Egor Kobylkin 2015-09-07 11:17:59 UTC
I would like to try to supply the data you need to make the Cyrillic transliteration work for the ru_RU locale. Could you point me to an example of the data you would need need?

Here is what I have tried just to see what works.
$ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT
iconv: (stdin):1:0: cannot convert

$ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT
iconv: (стдул):1:0: не може претворити

$ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
iconv: (Standard-Eingabe):1:0: Kann nicht umwandeln.

$ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
M"usste

$ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
iconv: (stdin):1:0: cannot convert

$ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
M"usste
Comment 6 Andreas Schwab 2015-09-07 11:51:15 UTC
Make sure you are not using any local modifications.

$ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT
???????
$ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT
iconv: illegal input sequence at position 0
$ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
???????
$ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
Muesste
$ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
???????
$ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
Musste
Comment 7 Egor Kobylkin 2015-09-07 12:17:53 UTC
Andreas, in your example the Cyrillic transliteration does not work either.  My understanding is that the tool is lacking a translation table for Cyrillic to TRANSLIT for example in the ru_RU locale. This is what Ulrich Drepper asks for in his comment2 here: https://sourceware.org/bugzilla/show_bug.cgi?id=2872#c2
I would like to know in which form this data should be provided?

I am only concerned with the Cyrillic for now. German serves as an example that the functionality works at all in at least one case.

My first submission is from Cygwin on Windows 7. While it may be indeed some effect of the Cygwin (name similarity to Cyrillic is coincidental) I have just tried the same on Ubuntu 12 for essentially same effect.

$ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT
iconv: illegal input sequence at position 0
$ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT
iconv: illegal input sequence at position 0
$ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
iconv: illegal input sequence at position 0
$ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT
Miconv: illegal input sequence at position 1
$ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
???????
$ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT
Musste
Comment 8 Marko Myllynen 2015-09-07 15:07:10 UTC
You'll need to setup testing environment where you can see how your changes affect to iconv and then try to come up with proper rules, the following links should help get you started (and are pretty much all the documentation there is):

https://sourceware.org/glibc/wiki/Locales
http://man7.org/linux/man-pages/man1/iconv.1.html
https://sourceware.org/bugzilla/show_bug.cgi?id=16061
https://sourceware.org/ml/libc-alpha/2015-07/msg00836.html
Comment 9 Egor Kobylkin 2015-09-07 23:07:10 UTC
Created attachment 8585 [details]
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file
Comment 10 Egor Kobylkin 2015-09-07 23:10:50 UTC
Created attachment 8586 [details]
translation table for transliteration of cyrillic to ascii

Single character version. Up to three characters are required to do a reversible transliteration. The table for a reversible transliteration can be created through the same spreadsheet included here.
Comment 11 Egor Kobylkin 2015-09-07 23:34:11 UTC
I have read the linked documents from Marko Myllynen Comment 8. 
My understanding so far is that apart from possibly required code parts that are not clear yet to me there should be a translation table for the transliteration.

Based on the 
man page http://man7.org/linux/man-pages/man5/locale.5.html
Russian GOST 7.79-2000 official transliteration table http://transliteration.ru/gost-7-79-2000/
and the Unicode file http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
I have created a single character transliteration table in the form of a following list
% CYRILLIC CAPITAL LETTER IO
<U0401> <U0059>
% CYRILLIC CAPITAL LETTER A
<U0410> <U0041>
% CYRILLIC CAPITAL LETTER BE
<U0411> <U0042>
% CYRILLIC CAPITAL LETTER VE
<U0412> <U0056>
etc.
First Unicode value is the Cyrillic letter and the second is a corresponding ASCII symbol.

The file is attached as translit_cyrillic. 
I wonder if it could be useful already for inclusion into the Latin based locales files via "include" keyword.

Please let me know what you think. Specifically my understanding is that this is the list that Ulrich Drepper was requesting.

I would be grateful if somebody familiar with the logic behind the transliteration file structure could outline the missing parts in case the above is not sufficient to get bootstrap the cyrillic-ascii transliteration.
Comment 12 Marko Myllynen 2015-09-08 06:35:36 UTC
I don't read Cyrillic but technically the table looks like what would be expected. I'm CC'ing Mike Fabian who has done the heavy-lifting for bug 16061 - Mike how does this look like to you?

If you didn't do so already, please test your changes, the earlier mentioned wiki page and the following man pages should provide all the needed information.

http://man7.org/linux/man-pages/man1/locale.1.html
http://man7.org/linux/man-pages/man1/localedef.1.html
http://man7.org/linux/man-pages/man7/locale.7.html
Comment 13 Egor Kobylkin 2015-09-08 07:40:34 UTC
Thank you for the feedback, Marko!
I will do the testing as suggested and will supply the multi-character transliteration as well. While for my purposes a single-character would do, it should be more practical to have the multi-character one in place.
Comment 14 Egor Kobylkin 2015-09-08 09:11:07 UTC
Created attachment 8588 [details]
a version that works for localedef

I have tested it with the en_GB locale including into the following section
LC_CTYPE
copy "i18n"

translit_start
include "translit_combining";"translit_cyrillic";""
translit_end
END LC_CTYPE

let's copy en_GB to en_TR for the testing purposes and generate the new locale en_TR.UTF-8 while being in glibc/localedata/locales
I18NPATH=./ localedef -f UTF-8 -i en_TR en_TR.UTF-8
Now we can test the transliteration 

$echo Съешь ещё этих мягких французских булок, да выпей же чаю |LOCPATH=. LC_ALL=en_TR.UTF-8 LANG=en_TR.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT
S`es` esy etix mygkix francuzskix bulok, da vypej ze cay
Comment 15 Egor Kobylkin 2015-09-08 09:57:14 UTC
Created attachment 8589 [details]
multi-character transliteration table cyrillic->ascii with fallback to single-character

I have not tested the fallback to the sigle-character but have included it in case somebody needs it. I am not sure on how to test it.
The file should ideally be included in all Latin based locales via include in this section as follows (example)
LC_CTYPE
copy "i18n"

translit_start
include "translit_combining";"translit_cyrillic";""
translit_end
END LC_CTYPE
Comment 16 Egor Kobylkin 2015-09-08 09:59:45 UTC
Created attachment 8590 [details]
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with milti-character transliteration
Comment 17 Egor Kobylkin 2015-09-08 10:05:59 UTC
Created attachment 8591 [details]
multi-character transliteration table cyrillic->ascii with fallback to single-character

correction: updated the comment in the file to reflect new spreadsheet and mutli-character feature

I have not tested the fallback to the sigle-character but have included it in case somebody needs it. I am not sure on how to test it.
The file should ideally be included in all Latin based locales via include in this section as follows (example)
LC_CTYPE
copy "i18n"

translit_start
include "translit_combining";"translit_cyrillic";""
translit_end
END LC_CTYPE
Comment 18 Egor Kobylkin 2015-09-08 10:20:52 UTC
(In reply to Ulrich Drepper from comment #2)
> Transliteration is locale dependend, there is no way around it:
> 
> Russian/Cyrillic:  &#1043;&#1086;&#1088;&#1073;&#1072;&#1095;&#1105;&#1074;
> 
> German transliteration: Gorbaschow
> 
> English transliteration: Gorbatsov or Gorbatsev
> 
> If you want cyrillic transliteration for the locale you use, provide the
> data.
I want to comment on this to clarify my starting point and ask for suggestions in case somebody decide to take on further development. For now I believe the issue is solved well however in a most basic way.

From the Russian speaking person point of view there are various transliterations possible for Cyrillic depending on the purpose. A good example of a multiplicity of such transliterations is listed here http://transliteration.ru/ However having different characters to represent the Cyrillic letters they have same phonetic meaning for a Russian-speaking person. So any of them could be used for all the Latin locales. This is what I propose as a first approximation to solve this issue. My submission above takes this approach with the GOST 7.79-2000 transliteration chosen as a basis.

For a non-Russian speaking person a yet different transliteration may make sense to represent their phonetic rules. This is what Ulrich is referring in his comment above. 

One could take my table as a basis and create separate transliteration tables to specific locales. The one I have proposed could then still serve as a ASCII//TRANSLIT target or be replaced by a most proper one.
Comment 19 Egor Kobylkin 2015-09-18 09:29:03 UTC
*** Bug 12031 has been marked as a duplicate of this bug. ***
Comment 20 Egor Kobylkin 2015-09-18 09:48:06 UTC
*** Bug 89 has been marked as a duplicate of this bug. ***
Comment 21 Egor Kobylkin 2015-09-18 12:54:58 UTC
Tested including the Greeklish_transliteraion https://sourceware.org/bugzilla/attachment.cgi?id=6380 from the duplication of this bug along with the cyrillic tranlation proposed in this bug.

Copied the file translit_greeklish to glibc/localedata/locales. In a copy of the en_GB locale en_TR2 changed this section 
LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
include "translit_cyrillic";""
include "translit_greeklish";""
translit_end
END LC_CTYPE

generated the en_TR2 locale I18NPATH=./ localedef -f UTF-8 -i en_TR2 ../../../en_TR/en_TR2.UTF-8

echo CYRILLIC Съешь ещё этих мягких французских булок, да выпей же чаю GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής |LOCPATH=.../en_TR/ LC_ALL=en_TR2.UTF-8 LANG=en_TR2.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT
CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu GREEK Ellhniko Idryma Eyrwpaikhs kai Ekswterikhs

Test successfull.
Comment 22 Egor Kobylkin 2015-09-18 13:07:22 UTC
Reset the bug status to "NEW", to signify it's ready for review by maintainers of the library
Comment 23 Marko Myllynen 2015-09-18 13:20:43 UTC
Looks like you're making very good progress here. According to https://sourceware.org/glibc/wiki/Locales the next step would be to check the Contribution checklist at https://sourceware.org/glibc/wiki/Contribution%20checklist and post your patches to libc-alpha + libc-locales for formal review.

However, please be aware that for example Mike's translit update patch has been pending for a review for many months already [1] so having the patches included might take a while. But the first step anyway is to post them to the lists.

1) https://sourceware.org/ml/libc-alpha/2015-09/msg00190.html

Thanks.
Comment 24 Egor Kobylkin 2015-09-18 14:06:36 UTC
Marco,

thanks for reviewing, I will proceed as you propose.

Just in case you know it would be great to have your advice:
In order to get the translit included into the default C.UTF8 locale what is the venue to discuss that? 

It is a default Cygwin locale and there is no way to generate an own locale in Cygwin environment AFAIK. But neither could I re-generate the C.UTF8 from the original POSIX file on my Ubuntu system to test.

I get the the same error messages as listed here. 
http://ask.debian.net/questions/how-to-generate-a-c-utf-8-locale-in-debian-squeeze

So this appears to be a blocker to generate a patch. It seems the POSIX source file for C.UTF8 is somehow broken for Ubuntu. Do I need to file another bug for that or is that by design?
Comment 25 Egor Kobylkin 2015-09-18 16:14:41 UTC
Pangramms in five languages to test the transliteration.

echo CYRILLIC Съешь ещё этих мягких французских булок, да выпей же чаю GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής GERMAN Zwölf Boxkämpfer jagen Victor quer über den großen Sylter Deich FRENCH Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d’exquis rôtis de bœuf au kir à l’aÿ d’âge mûr \& cætera SPANISH El veloz murciélago hindú comía feliz cardillo y kiwi, la cigüeña tocaba el saxofón detrás del palenque de paja|LOCPATH=./ LC_ALL=en_TR2.UTF-8 LANG=en_TR2.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT

And the result so you can compare. 
CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu GREEK Ellhniko Idryma Eyrwpaikhs kai Ekswterikhs GERMAN Zwolf Boxkampfer jagen Victor quer uber den grossen Sylter Deich FRENCH Des Noel ou un zephyr hai me vet de glacons wurmiens je dine d'exquis rotis de boeuf au kir a l'ay d'age mur & caetera SPANISH El veloz murcielago hindu comia feliz cardillo y kiwi, la ciguena tocaba el saxofon detras del palenque de paja
Comment 26 Egor Kobylkin 2015-09-18 16:37:15 UTC
Created attachment 8618 [details]
test file for 
https://sourceware.org/glibc/wiki/Locales#Testing_Locales
Comment 27 Marko Myllynen 2015-09-21 05:30:26 UTC
(In reply to Egor Kobylkin from comment #24)
> 
> Just in case you know it would be great to have your advice:
> In order to get the translit included into the default C.UTF8 locale what is
> the venue to discuss that? 

I think it's best to proceed one step at a time - as said it might take a while to have even the rules included and then C.UTF-8 would need to be implemented in upstream (see bug 17318).

> It is a default Cygwin locale and there is no way to generate an own locale
> in Cygwin environment AFAIK. But neither could I re-generate the C.UTF8 from
> the original POSIX file on my Ubuntu system to test.
> 
> I get the the same error messages as listed here. 
> http://ask.debian.net/questions/how-to-generate-a-c-utf-8-locale-in-debian-
> squeeze
> 
> So this appears to be a blocker to generate a patch. It seems the POSIX
> source file for C.UTF8 is somehow broken for Ubuntu. Do I need to file
> another bug for that or is that by design?

These all sound like distribution / downstream related issues which should be handled there, not in glibc upsptream.

Thanks.
Comment 28 Egor Kobylkin 2018-07-18 09:15:30 UTC
I have submitted the patch to libc-alpha and libc-locales https://sourceware.org/ml/libc-alpha/2018-07/msg00503.html and was asked to re-submit in August 2018 to be reviewed for 2.29 inclusion https://sourceware.org/ml/libc-alpha/2018-07/msg00506.html
Comment 29 Egor Kobylkin 2018-07-18 09:24:17 UTC
Created attachment 11144 [details]
the patch adding translit_cyrillic and including it into locales

From this patch I have excluded locales that already mention cyrillic or
have a transliteration table for it:
az_AZ
iso14651_t1_common
ky_KG
mn_MN
sr_RS
tg_TJ
tk_TM
tt_RU
uk_UA
uz_UZ
uz_UZ@cyrillic

Their maintainers are requested to make an explicit decision on how and
whether at all to include this patch.

[BZ #2872]
	* locales/translit_cyrillic: add Russian GOST 7.79-2000 transliteration
table from Cyrillic to Latin.
	* locales/C: add include "translit_cyrillic";"" to LC_CTYPE translit
section.
	* locales/aa_DJ: likewise
	* locales/af_ZA: likewise
	* locales/ak_GH: likewise
	* locales/am_ET: likewise
	* locales/ar_EG: likewise
	* locales/be_BY: likewise
	* locales/bem_ZM: likewise
	* locales/ber_DZ: likewise
	* locales/ber_MA: likewise
	* locales/bg_BG: likewise
	* locales/bi_VU: likewise
	* locales/bn_BD: likewise
	* locales/bo_CN: likewise
	* locales/ca_ES: likewise
	* locales/ce_RU: likewise
	* locales/cs_CZ: likewise
	* locales/cv_RU: likewise
	* locales/cy_GB: likewise
	* locales/da_DK: likewise
	* locales/de_DE: likewise
	* locales/dv_MV: likewise
	* locales/dz_BT: likewise
	* locales/el_GR: likewise
	* locales/en_GB: likewise
	* locales/en_NG: likewise
	* locales/en_ZM: likewise
	* locales/es_CU: likewise
	* locales/es_ES: likewise
	* locales/et_EE: likewise
	* locales/fa_IR: likewise
	* locales/ff_SN: likewise
	* locales/fi_FI: likewise
	* locales/fr_FR: likewise
	* locales/ga_IE: likewise
	* locales/gd_GB: likewise
	* locales/gu_IN: likewise
	* locales/gv_GB: likewise
	* locales/he_IL: likewise
	* locales/hi_IN: likewise
	* locales/hif_FJ: likewise
	* locales/hr_HR: likewise
	* locales/ht_HT: likewise
	* locales/hu_HU: likewise
	* locales/hy_AM: likewise
	* locales/id_ID: likewise
	* locales/is_IS: likewise
	* locales/it_IT: likewise
	* locales/ja_JP: likewise
	* locales/kk_KZ: likewise
	* locales/km_KH: likewise
	* locales/kn_IN: likewise
	* locales/ko_KR: likewise
	* locales/ks_IN: likewise
	* locales/kw_GB: likewise
	* locales/lb_LU: likewise
	* locales/lg_UG: likewise
	* locales/lij_IT: likewise
	* locales/ln_CD: likewise
	* locales/lo_LA: likewise
	* locales/lt_LT: likewise
	* locales/lv_LV: likewise
	* locales/mg_MG: likewise
	* locales/mhr_RU: likewise
	* locales/mk_MK: likewise
	* locales/ml_IN: likewise
	* locales/ms_MY: likewise
	* locales/mt_MT: likewise
	* locales/nan_TW@latin: likewise
	* locales/nb_NO: likewise
	* locales/ne_NP: likewise
	* locales/nhn_MX: likewise
	* locales/niu_NU: likewise
	* locales/niu_NZ: likewise
	* locales/nl_NL: likewise
	* locales/nr_ZA: likewise
	* locales/oc_FR: likewise
	* locales/om_KE: likewise
	* locales/or_IN: likewise
	* locales/os_RU: likewise
	* locales/pa_IN: likewise
	* locales/pa_PK: likewise
	* locales/pl_PL: likewise
	* locales/pt_PT: likewise
	* locales/quz_PE: likewise
	* locales/ro_RO: likewise
	* locales/ru_RU: likewise
	* locales/rw_RW: likewise
	* locales/sa_IN: likewise
	* locales/sd_IN: likewise
	* locales/sd_IN@devanagari: likewise
	* locales/sd_PK: likewise
	* locales/se_NO: likewise
	* locales/sgs_LT: likewise
	* locales/si_LK: likewise
	* locales/sk_SK: likewise
	* locales/sl_SI: likewise
	* locales/sm_WS: likewise
	* locales/so_SO: likewise
	* locales/sq_AL: likewise
	* locales/ss_ZA: likewise
	* locales/st_ZA: likewise
	* locales/sv_SE: likewise
	* locales/sw_KE: likewise
	* locales/ta_IN: likewise
	* locales/te_IN: likewise
	* locales/th_TH: likewise
	* locales/ti_ET: likewise
	* locales/tn_ZA: likewise
	* locales/to_TO: likewise
	* locales/tpi_PG: likewise
	* locales/tr_TR: likewise
	* locales/ts_ZA: likewise
	* locales/unm_US: likewise
	* locales/ur_IN: likewise
	* locales/ur_PK: likewise
	* locales/ve_ZA: likewise
	* locales/vi_VN: likewise
	* locales/wa_BE: likewise
	* locales/wo_SN: likewise
	* locales/xh_ZA: likewise
	* locales/yi_US: likewise
	* locales/zh_CN: likewise
	* locales/zu_ZA: likewise
Comment 30 Egor Kobylkin 2018-07-18 09:25:49 UTC
Setting the status to New again - there is now a patch to review.
Comment 31 Egor Kobylkin 2018-10-05 10:35:35 UTC
Created attachment 11289 [details]
screenshot of the ISO 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru)

for a quick look up. source: http://transliteration.ru/gost-7-79-2000/
Comment 32 Egor Kobylkin 2018-10-06 20:05:19 UTC
Created attachment 11290 [details]
the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration

This version implements the ISO 9:1995/GOST_7.79 System A and System B Cyrillic transliteration table. The System B is extended from GOST_7.79 using open sources of transliteration mapping.
Comment 33 Egor Kobylkin 2018-10-06 20:07:35 UTC
Created attachment 11291 [details]
Transliteration table Cyrillic->Latin with fallback to ASCII

Now it has System A (Latin Script) as a first option and System B (ASCII) as a second option for each entry.
Comment 34 Egor Kobylkin 2018-10-06 20:44:16 UTC
Created attachment 11292 [details]
Transliteration table Cyrillic->Latin with fallback to ASCII

Now it has System A (Latin Script) as a first option and System B (ASCII) as a second option for each entry. Added some explanation and reference to ISO 9.1995 as a comment.
Comment 35 Egor Kobylkin 2018-10-06 20:49:55 UTC
Created attachment 11293 [details]
test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales

Now with the characters from ISO 9.1995 GOST 7.79_System_A that go beyond current Russian alphabet and hopefully cover all relevant Cyrillic letters for transliteration.
Comment 36 Egor Kobylkin 2018-10-06 20:54:22 UTC
https://sourceware.org/ml/libc-locales/2018-q4/msg00013.html

After some kind help from Marko in the offline discussion
I realized the multi/single character approach I originally took was
against the  of the iconv(1) logic anyway. So there is no harm in
dropping it and adopting Marko's suggestion instead. I will do so and
will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to
GOST 7.79 System B (for ASCII).

However this doesn't resolve the issue for ASCII part being different
for various locales. Again, I am offering the locale maintainers to let
me know if they want to 1) adopt the one I am supplying, 2) write their
own or 3) ignore the patch altogether. Your feedback is appreciated!

This is the relevant part that helped:
> The first part (ISO-8859-15 or ASCII) defines the target encoding for
> iconv(1). //TRANSLIT is described in the iconv(1) man page as:
> 
> If the string //TRANSLIT is appended to to-encoding,  characters 
> being  converted  are  transliterated  when needed and possible. This
> means that when a character cannot be  represented  in  the target
> character set, it can be approximated through one or sev‐ eral
> similar looking characters.  Characters that are outside of the
> target  character  set  and  cannot  be  transliterated are replaced
> with a question mark (?) in the output.
> 
> So in the above examples, iconv(1) encounters the character U+0428
> which is not part of either of the target encoding and since
> //TRANSLIT is specified, iconv(1) tries transliteration according to
> the rules defined above, in case of ASCII U+0160 is not part of the
> target encoding so the next alternative is used.
Comment 37 Egor Kobylkin 2018-10-08 22:46:05 UTC
Created attachment 11298 [details]
LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration

Now more graphic with Cyrillic-ISO9translit-ASCIItranscription on the same page and colored out.
Comment 38 Egor Kobylkin 2018-10-08 23:30:34 UTC
Created attachment 11299 [details]
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration

Fixed issues identified here https://sourceware.org/ml/libc-locales/2018-q4/msg00019.html
Comment 39 Egor Kobylkin 2018-10-09 10:43:01 UTC
Created attachment 11300 [details]
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration

Now with the "Result Export to CSV" tab that can be exported into txt and copypasted into translit_cyrillic after removing trailing spaces. If we could code the rules implemented in the formulas this could become the generator script. Worksheet "ISO 9.1995 System A GOST 9.97 System B" columns contain the actual mapping and the rest is just the logic to get to "Result Export to CSV" glibc format for translit_cyrillic.
Comment 40 Egor Kobylkin 2018-10-09 19:00:46 UTC
Created attachment 11301 [details]
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration


Removed "" around the "<U0423><U0301>" (<U00DA>) and
"<U0443><U0301>" (<U00FA>) as it was breaking locale compliation.

It works now with
% CYRILLIC UNDEFINED
<U0423><U0301> <U00DA>;"<U0055><U0060>"
% CYRILLIC UNDEFINED
<U0443><U0301> <U00FA>;"<U0075><U0060>"
Comment 41 Egor Kobylkin 2018-10-09 19:06:56 UTC
Created attachment 11302 [details]
Transliteration table Cyrillic->Latin with fallback to ASCII

Final version in preparation for the patch.
Comment 42 Egor Kobylkin 2018-10-09 19:09:19 UTC
Created attachment 11303 [details]
the patch adding translit_cyrillic and including it into locales
Comment 43 Egor Kobylkin 2018-10-09 19:11:59 UTC
Created attachment 11304 [details]
test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales

Added the cyrillic characters covering Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 as CYRILLIC COMPLETE.
Renamed CYRILLIC to CYRILLIC RUSSIAN and added all capital letters to the text there.
Comment 44 Egor Kobylkin 2018-10-11 15:41:38 UTC
Created attachment 11316 [details]
the patch adding translit_cyrillic and including it into locales

>
> "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII".
>
>> +% Inspired by ISO 9.1995 / GOST 7.79-2000.
>> +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
>> +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995
>
> Typos:
>
> "i.e" -> "i.e.," (somebody please fix me if I'm wrong here)
> "U4001" - I guess you meant "U0401"
> "U4F9" -> "U04F9".  I think that "U4F9" is not definitely bad but
> let's be consistent.

These are all good catches. I will fix them and resubmit.

[FIXED]
Comment 45 Egor Kobylkin 2018-10-11 15:43:02 UTC
Created attachment 11317 [details]
Transliteration table Cyrillic->Latin with fallback to ASCII

typos
Comment 46 Egor Kobylkin 2018-10-16 08:15:41 UTC
Created attachment 11334 [details]
the patch adding translit_cyrillic and including it into locales

now against the glibc 2.28 source
Comment 47 Egor Kobylkin 2018-10-16 08:45:50 UTC
Created attachment 11335 [details]
copy of localedata/bug-iconv-trans.c for cyrillic

saved as UTF-8 as opposed to the original ISO-8859-15

--- bug-iconv-trans.c	2018-10-15 11:53:51.509030034 +0000
+++ bug-iconv-trans-cyr.c	2018-10-15 11:54:02.385071250 +0000
@@ -7,8 +7,8 @@
 main (void)
 {
   iconv_t cd;
-  const char str[] = "�������";
-  const char expected[] = "AEaeOEoeUEuess";
+  const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’";
+  const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
   char *inptr = (char *) str;
   size_t inlen = strlen (str) + 1;
   char outbuf[500];
@@ -23,7 +23,7 @@
       return 1;
     }
 
-  cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "ISO-8859-1");
+  cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
   if (cd == (iconv_t) -1)
     {
       puts ("iconv_open failed");
@@ -31,7 +31,7 @@
     }
 
   n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
-  if (n != 7)
+  if (n != 174)
     {
       if (n == (size_t) -1)
Comment 48 Egor Kobylkin 2018-10-17 13:53:38 UTC
Created attachment 11340 [details]
Transliteration table Cyrillic->Latin with fallback to ASCII

fixed capitalisation for the historic letters
Comment 49 Egor Kobylkin 2018-10-17 14:10:46 UTC
Created attachment 11341 [details]
the patch adding translit_cyrillic and including it into locales

updating timestamps
Comment 50 Egor Kobylkin 2018-11-15 08:55:33 UTC
Created attachment 11396 [details]
the patch adding translit_cyrillic and including it into locales

* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch
Comment 51 Egor Kobylkin 2018-11-19 11:04:27 UTC
Created attachment 11402 [details]
The patch adding translit_cyrillic and including it into locales
Comment 52 Egor Kobylkin 2018-11-19 11:06:46 UTC
Created attachment 11403 [details]
Transliteration table Cyrillic-> ASCII

Stripped System A. File now only has ISO 9:1995/GOST_7.79 System B
Comment 53 Egor Kobylkin 2018-11-19 13:54:59 UTC
Created attachment 11404 [details]
LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration

with capitalisation of the transcription of capital letters and System B table for export
Comment 54 Egor Kobylkin 2018-12-08 22:31:52 UTC
Created attachment 11442 [details]
The patch adding Cyrillic translit to locale/C-translit.h.in

* Re-targeted the patch against locale/C-translit.h.in as the proper
file for the ASCII translit table.
* Correspondingly the patch now only contains the additional
Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
The 'include "translit_cyrillic";""' directives are not necessary in the
locale files and they are now all left intact.
* Also the file translit_cyrillic is not longer needed and is omitted.
Comment 55 Egor Kobylkin 2018-12-08 22:34:59 UTC
Created attachment 11443 [details]
LibreOffice Calc spreadsheet used to generate the Cyrillic transliteration table

now with the rows in the format of locale/C-translit.h.in
Comment 56 Egor Kobylkin 2018-12-08 22:36:00 UTC
Comment on attachment 11403 [details]
Transliteration table Cyrillic-> ASCII

a separate file not needed anymore
Comment 57 Egor Kobylkin 2018-12-10 01:25:37 UTC
Changing the "component" parameter of this bug to the "locale" because ASCII is the target character set for the C locale and it is residing in locale/C-translit.h.in.
Comment 58 Egor Kobylkin 2019-01-02 18:40:37 UTC
Created attachment 11505 [details]
The patch adding Cyrillic translit to locale/C-translit.h.in
Comment 59 Egor Kobylkin 2019-05-02 21:32:32 UTC
Added as release blocker for 2.30 on suggestion of Siddhesh Poyarekar  https://sourceware.org/ml/libc-alpha/2019-04/msg00566.html
Comment 60 Sourceware Commits 2019-07-20 19:57:17 UTC
The master branch has been updated by Rafal Luzynski <rl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c7e4b684e77323d1ef85dcdde8a41411ebe3b581

commit c7e4b684e77323d1ef85dcdde8a41411ebe3b581
Author: Egor Kobylkin <egor@kobylkin.com>
Date:   Wed Jan 2 05:50:13 2019 +0100

    locale/C-translit.h.in: Cyrillic -> ASCII transliteration [BZ #2872]
    
    This patch adds Cyrillic to plain ASCII transliteration table according
    to GOST 7.79-2000 System B standard to the C locale.
    
    	[BZ #2872]
    	* locale/C-translit.h.in: Add Cyrillic transliteration.
Comment 61 Rafal Luzynski 2019-07-23 19:14:34 UTC
Cyrillic to plain ASCII added by commit c7e4b684e77323d1ef85dcdde8a41411ebe3b581.