Hello, I tried to convert some text from Cyrillic (UTF-8) to ASCII, using the //translit flag. However, it fails badly, all chars are just replaced with ?. It seems to be independent from my current locale, I can set en_US.UTF-8 or de_DE.UTF-8 or ru_RU.UTF-8, it still fails. Transliteration of latin seems to work, though: echo Müßte асдфасфд | LANG=de_DE.UTF-8 iconv -f UTF-8 -t ASCII//translit Muesste ???????? echo Müßte асдфасфд | LANG=fr_FR.UTF-8 iconv -f UTF-8 -t ASCII//translit Musste ???????? I had a "discussion" with a Debian maintainer of glibc, who indicated that the problem is in the locale which controls the transliterations... but, from my POV, there should be a default fallback when there is no other transliteration scheme. And I remember that it has been working with glibc some months or years ago.
Works for me with sr_CS locale: $ echo 'Müßte Данило' | LANG=sr_CS.UTF-8 iconv -t ASCII//translit Musste Danilo
Transliteration is locale dependend, there is no way around it: Russian/Cyrillic: Горбачёв German transliteration: Gorbaschow English transliteration: Gorbatsov or Gorbatsev If you want cyrillic transliteration for the locale you use, provide the data.
"WORKSFORME" implies that you cannot reproduce the problem, but... does transliterating Горбачов work with English or not? (see below) If not, how can it be "resolved"? Or what are you trying to say with "provide the data"? That there is no data yet? I have seen it working some years ago. With correct US-style transliterations. If it is broken now or data has been lost, then TRANSLIT maybe should be disabled and throw an error immediately. Currently it produces crap and AFAICS there is no proper documentation explaining why. The crap it creates is not even consistent within the same language/country pair, without .UTF-8 suffix it produces more funny non-sense. echo Горбачов | LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT ???????? echo Горбачов | LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT ???????? echo Горбачов | LANG=de_DE iconv -t ASCII//TRANSLIT ??? 3/4 N?????N?? 3/4 ?? echo Горбачов | LANG=en_US iconv -t ASCII//TRANSLIT iconv: illegal input sequence at position 0
It all works as designed given the data provided. If you want change, provide the data. Otherwise go away.
I would like to try to supply the data you need to make the Cyrillic transliteration work for the ru_RU locale. Could you point me to an example of the data you would need need? Here is what I have tried just to see what works. $ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT iconv: (stdin):1:0: cannot convert $ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT iconv: (стдул):1:0: не може претворити $ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT iconv: (Standard-Eingabe):1:0: Kann nicht umwandeln. $ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT M"usste $ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT iconv: (stdin):1:0: cannot convert $ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT M"usste
Make sure you are not using any local modifications. $ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT ??????? $ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT iconv: illegal input sequence at position 0 $ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT ??????? $ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT Muesste $ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT ??????? $ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT Musste
Andreas, in your example the Cyrillic transliteration does not work either. My understanding is that the tool is lacking a translation table for Cyrillic to TRANSLIT for example in the ru_RU locale. This is what Ulrich Drepper asks for in his comment2 here: https://sourceware.org/bugzilla/show_bug.cgi?id=2872#c2 I would like to know in which form this data should be provided? I am only concerned with the Cyrillic for now. German serves as an example that the functionality works at all in at least one case. My first submission is from Cygwin on Windows 7. While it may be indeed some effect of the Cygwin (name similarity to Cyrillic is coincidental) I have just tried the same on Ubuntu 12 for essentially same effect. $ echo Лаковый |LANG=ru_RU.UTF-8 iconv -t ASCII//TRANSLIT iconv: illegal input sequence at position 0 $ echo Лаковый |LANG=sr_CS.UTF-8 iconv -t ASCII//TRANSLIT iconv: illegal input sequence at position 0 $ echo Лаковый |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT iconv: illegal input sequence at position 0 $ echo Müßte |LANG=de_DE.UTF-8 iconv -t ASCII//TRANSLIT Miconv: illegal input sequence at position 1 $ echo Лаковый |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT ??????? $ echo Müßte |LANG=en_US.UTF-8 iconv -t ASCII//TRANSLIT Musste
You'll need to setup testing environment where you can see how your changes affect to iconv and then try to come up with proper rules, the following links should help get you started (and are pretty much all the documentation there is): https://sourceware.org/glibc/wiki/Locales http://man7.org/linux/man-pages/man1/iconv.1.html https://sourceware.org/bugzilla/show_bug.cgi?id=16061 https://sourceware.org/ml/libc-alpha/2015-07/msg00836.html
Created attachment 8585 [details] the LibreOffice Calc spreadsheet used to create the translit_cyrillic file
Created attachment 8586 [details] translation table for transliteration of cyrillic to ascii Single character version. Up to three characters are required to do a reversible transliteration. The table for a reversible transliteration can be created through the same spreadsheet included here.
I have read the linked documents from Marko Myllynen Comment 8. My understanding so far is that apart from possibly required code parts that are not clear yet to me there should be a translation table for the transliteration. Based on the man page http://man7.org/linux/man-pages/man5/locale.5.html Russian GOST 7.79-2000 official transliteration table http://transliteration.ru/gost-7-79-2000/ and the Unicode file http://www.unicode.org/Public/UNIDATA/UnicodeData.txt I have created a single character transliteration table in the form of a following list % CYRILLIC CAPITAL LETTER IO <U0401> <U0059> % CYRILLIC CAPITAL LETTER A <U0410> <U0041> % CYRILLIC CAPITAL LETTER BE <U0411> <U0042> % CYRILLIC CAPITAL LETTER VE <U0412> <U0056> etc. First Unicode value is the Cyrillic letter and the second is a corresponding ASCII symbol. The file is attached as translit_cyrillic. I wonder if it could be useful already for inclusion into the Latin based locales files via "include" keyword. Please let me know what you think. Specifically my understanding is that this is the list that Ulrich Drepper was requesting. I would be grateful if somebody familiar with the logic behind the transliteration file structure could outline the missing parts in case the above is not sufficient to get bootstrap the cyrillic-ascii transliteration.
I don't read Cyrillic but technically the table looks like what would be expected. I'm CC'ing Mike Fabian who has done the heavy-lifting for bug 16061 - Mike how does this look like to you? If you didn't do so already, please test your changes, the earlier mentioned wiki page and the following man pages should provide all the needed information. http://man7.org/linux/man-pages/man1/locale.1.html http://man7.org/linux/man-pages/man1/localedef.1.html http://man7.org/linux/man-pages/man7/locale.7.html
Thank you for the feedback, Marko! I will do the testing as suggested and will supply the multi-character transliteration as well. While for my purposes a single-character would do, it should be more practical to have the multi-character one in place.
Created attachment 8588 [details] a version that works for localedef I have tested it with the en_GB locale including into the following section LC_CTYPE copy "i18n" translit_start include "translit_combining";"translit_cyrillic";"" translit_end END LC_CTYPE let's copy en_GB to en_TR for the testing purposes and generate the new locale en_TR.UTF-8 while being in glibc/localedata/locales I18NPATH=./ localedef -f UTF-8 -i en_TR en_TR.UTF-8 Now we can test the transliteration $echo Съешь ещё этих мягких французских булок, да выпей же чаю |LOCPATH=. LC_ALL=en_TR.UTF-8 LANG=en_TR.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT S`es` esy etix mygkix francuzskix bulok, da vypej ze cay
Created attachment 8589 [details] multi-character transliteration table cyrillic->ascii with fallback to single-character I have not tested the fallback to the sigle-character but have included it in case somebody needs it. I am not sure on how to test it. The file should ideally be included in all Latin based locales via include in this section as follows (example) LC_CTYPE copy "i18n" translit_start include "translit_combining";"translit_cyrillic";"" translit_end END LC_CTYPE
Created attachment 8590 [details] the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with milti-character transliteration
Created attachment 8591 [details] multi-character transliteration table cyrillic->ascii with fallback to single-character correction: updated the comment in the file to reflect new spreadsheet and mutli-character feature I have not tested the fallback to the sigle-character but have included it in case somebody needs it. I am not sure on how to test it. The file should ideally be included in all Latin based locales via include in this section as follows (example) LC_CTYPE copy "i18n" translit_start include "translit_combining";"translit_cyrillic";"" translit_end END LC_CTYPE
(In reply to Ulrich Drepper from comment #2) > Transliteration is locale dependend, there is no way around it: > > Russian/Cyrillic: Горбачёв > > German transliteration: Gorbaschow > > English transliteration: Gorbatsov or Gorbatsev > > If you want cyrillic transliteration for the locale you use, provide the > data. I want to comment on this to clarify my starting point and ask for suggestions in case somebody decide to take on further development. For now I believe the issue is solved well however in a most basic way. From the Russian speaking person point of view there are various transliterations possible for Cyrillic depending on the purpose. A good example of a multiplicity of such transliterations is listed here http://transliteration.ru/ However having different characters to represent the Cyrillic letters they have same phonetic meaning for a Russian-speaking person. So any of them could be used for all the Latin locales. This is what I propose as a first approximation to solve this issue. My submission above takes this approach with the GOST 7.79-2000 transliteration chosen as a basis. For a non-Russian speaking person a yet different transliteration may make sense to represent their phonetic rules. This is what Ulrich is referring in his comment above. One could take my table as a basis and create separate transliteration tables to specific locales. The one I have proposed could then still serve as a ASCII//TRANSLIT target or be replaced by a most proper one.
*** Bug 12031 has been marked as a duplicate of this bug. ***
*** Bug 89 has been marked as a duplicate of this bug. ***
Tested including the Greeklish_transliteraion https://sourceware.org/bugzilla/attachment.cgi?id=6380 from the duplication of this bug along with the cyrillic tranlation proposed in this bug. Copied the file translit_greeklish to glibc/localedata/locales. In a copy of the en_GB locale en_TR2 changed this section LC_CTYPE copy "i18n" translit_start include "translit_combining";"" include "translit_cyrillic";"" include "translit_greeklish";"" translit_end END LC_CTYPE generated the en_TR2 locale I18NPATH=./ localedef -f UTF-8 -i en_TR2 ../../../en_TR/en_TR2.UTF-8 echo CYRILLIC Съешь ещё этих мягких французских булок, да выпей же чаю GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής |LOCPATH=.../en_TR/ LC_ALL=en_TR2.UTF-8 LANG=en_TR2.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu GREEK Ellhniko Idryma Eyrwpaikhs kai Ekswterikhs Test successfull.
Reset the bug status to "NEW", to signify it's ready for review by maintainers of the library
Looks like you're making very good progress here. According to https://sourceware.org/glibc/wiki/Locales the next step would be to check the Contribution checklist at https://sourceware.org/glibc/wiki/Contribution%20checklist and post your patches to libc-alpha + libc-locales for formal review. However, please be aware that for example Mike's translit update patch has been pending for a review for many months already [1] so having the patches included might take a while. But the first step anyway is to post them to the lists. 1) https://sourceware.org/ml/libc-alpha/2015-09/msg00190.html Thanks.
Marco, thanks for reviewing, I will proceed as you propose. Just in case you know it would be great to have your advice: In order to get the translit included into the default C.UTF8 locale what is the venue to discuss that? It is a default Cygwin locale and there is no way to generate an own locale in Cygwin environment AFAIK. But neither could I re-generate the C.UTF8 from the original POSIX file on my Ubuntu system to test. I get the the same error messages as listed here. http://ask.debian.net/questions/how-to-generate-a-c-utf-8-locale-in-debian-squeeze So this appears to be a blocker to generate a patch. It seems the POSIX source file for C.UTF8 is somehow broken for Ubuntu. Do I need to file another bug for that or is that by design?
Pangramms in five languages to test the transliteration. echo CYRILLIC Съешь ещё этих мягких французских булок, да выпей же чаю GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής GERMAN Zwölf Boxkämpfer jagen Victor quer über den großen Sylter Deich FRENCH Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d’exquis rôtis de bœuf au kir à l’aÿ d’âge mûr \& cætera SPANISH El veloz murciélago hindú comía feliz cardillo y kiwi, la cigüeña tocaba el saxofón detrás del palenque de paja|LOCPATH=./ LC_ALL=en_TR2.UTF-8 LANG=en_TR2.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT And the result so you can compare. CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe chayu GREEK Ellhniko Idryma Eyrwpaikhs kai Ekswterikhs GERMAN Zwolf Boxkampfer jagen Victor quer uber den grossen Sylter Deich FRENCH Des Noel ou un zephyr hai me vet de glacons wurmiens je dine d'exquis rotis de boeuf au kir a l'ay d'age mur & caetera SPANISH El veloz murcielago hindu comia feliz cardillo y kiwi, la ciguena tocaba el saxofon detras del palenque de paja
Created attachment 8618 [details] test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales
(In reply to Egor Kobylkin from comment #24) > > Just in case you know it would be great to have your advice: > In order to get the translit included into the default C.UTF8 locale what is > the venue to discuss that? I think it's best to proceed one step at a time - as said it might take a while to have even the rules included and then C.UTF-8 would need to be implemented in upstream (see bug 17318). > It is a default Cygwin locale and there is no way to generate an own locale > in Cygwin environment AFAIK. But neither could I re-generate the C.UTF8 from > the original POSIX file on my Ubuntu system to test. > > I get the the same error messages as listed here. > http://ask.debian.net/questions/how-to-generate-a-c-utf-8-locale-in-debian- > squeeze > > So this appears to be a blocker to generate a patch. It seems the POSIX > source file for C.UTF8 is somehow broken for Ubuntu. Do I need to file > another bug for that or is that by design? These all sound like distribution / downstream related issues which should be handled there, not in glibc upsptream. Thanks.
I have submitted the patch to libc-alpha and libc-locales https://sourceware.org/ml/libc-alpha/2018-07/msg00503.html and was asked to re-submit in August 2018 to be reviewed for 2.29 inclusion https://sourceware.org/ml/libc-alpha/2018-07/msg00506.html
Created attachment 11144 [details] the patch adding translit_cyrillic and including it into locales From this patch I have excluded locales that already mention cyrillic or have a transliteration table for it: az_AZ iso14651_t1_common ky_KG mn_MN sr_RS tg_TJ tk_TM tt_RU uk_UA uz_UZ uz_UZ@cyrillic Their maintainers are requested to make an explicit decision on how and whether at all to include this patch. [BZ #2872] * locales/translit_cyrillic: add Russian GOST 7.79-2000 transliteration table from Cyrillic to Latin. * locales/C: add include "translit_cyrillic";"" to LC_CTYPE translit section. * locales/aa_DJ: likewise * locales/af_ZA: likewise * locales/ak_GH: likewise * locales/am_ET: likewise * locales/ar_EG: likewise * locales/be_BY: likewise * locales/bem_ZM: likewise * locales/ber_DZ: likewise * locales/ber_MA: likewise * locales/bg_BG: likewise * locales/bi_VU: likewise * locales/bn_BD: likewise * locales/bo_CN: likewise * locales/ca_ES: likewise * locales/ce_RU: likewise * locales/cs_CZ: likewise * locales/cv_RU: likewise * locales/cy_GB: likewise * locales/da_DK: likewise * locales/de_DE: likewise * locales/dv_MV: likewise * locales/dz_BT: likewise * locales/el_GR: likewise * locales/en_GB: likewise * locales/en_NG: likewise * locales/en_ZM: likewise * locales/es_CU: likewise * locales/es_ES: likewise * locales/et_EE: likewise * locales/fa_IR: likewise * locales/ff_SN: likewise * locales/fi_FI: likewise * locales/fr_FR: likewise * locales/ga_IE: likewise * locales/gd_GB: likewise * locales/gu_IN: likewise * locales/gv_GB: likewise * locales/he_IL: likewise * locales/hi_IN: likewise * locales/hif_FJ: likewise * locales/hr_HR: likewise * locales/ht_HT: likewise * locales/hu_HU: likewise * locales/hy_AM: likewise * locales/id_ID: likewise * locales/is_IS: likewise * locales/it_IT: likewise * locales/ja_JP: likewise * locales/kk_KZ: likewise * locales/km_KH: likewise * locales/kn_IN: likewise * locales/ko_KR: likewise * locales/ks_IN: likewise * locales/kw_GB: likewise * locales/lb_LU: likewise * locales/lg_UG: likewise * locales/lij_IT: likewise * locales/ln_CD: likewise * locales/lo_LA: likewise * locales/lt_LT: likewise * locales/lv_LV: likewise * locales/mg_MG: likewise * locales/mhr_RU: likewise * locales/mk_MK: likewise * locales/ml_IN: likewise * locales/ms_MY: likewise * locales/mt_MT: likewise * locales/nan_TW@latin: likewise * locales/nb_NO: likewise * locales/ne_NP: likewise * locales/nhn_MX: likewise * locales/niu_NU: likewise * locales/niu_NZ: likewise * locales/nl_NL: likewise * locales/nr_ZA: likewise * locales/oc_FR: likewise * locales/om_KE: likewise * locales/or_IN: likewise * locales/os_RU: likewise * locales/pa_IN: likewise * locales/pa_PK: likewise * locales/pl_PL: likewise * locales/pt_PT: likewise * locales/quz_PE: likewise * locales/ro_RO: likewise * locales/ru_RU: likewise * locales/rw_RW: likewise * locales/sa_IN: likewise * locales/sd_IN: likewise * locales/sd_IN@devanagari: likewise * locales/sd_PK: likewise * locales/se_NO: likewise * locales/sgs_LT: likewise * locales/si_LK: likewise * locales/sk_SK: likewise * locales/sl_SI: likewise * locales/sm_WS: likewise * locales/so_SO: likewise * locales/sq_AL: likewise * locales/ss_ZA: likewise * locales/st_ZA: likewise * locales/sv_SE: likewise * locales/sw_KE: likewise * locales/ta_IN: likewise * locales/te_IN: likewise * locales/th_TH: likewise * locales/ti_ET: likewise * locales/tn_ZA: likewise * locales/to_TO: likewise * locales/tpi_PG: likewise * locales/tr_TR: likewise * locales/ts_ZA: likewise * locales/unm_US: likewise * locales/ur_IN: likewise * locales/ur_PK: likewise * locales/ve_ZA: likewise * locales/vi_VN: likewise * locales/wa_BE: likewise * locales/wo_SN: likewise * locales/xh_ZA: likewise * locales/yi_US: likewise * locales/zh_CN: likewise * locales/zu_ZA: likewise
Setting the status to New again - there is now a patch to review.
Created attachment 11289 [details] screenshot of the ISO 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) for a quick look up. source: http://transliteration.ru/gost-7-79-2000/
Created attachment 11290 [details] the LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration This version implements the ISO 9:1995/GOST_7.79 System A and System B Cyrillic transliteration table. The System B is extended from GOST_7.79 using open sources of transliteration mapping.
Created attachment 11291 [details] Transliteration table Cyrillic->Latin with fallback to ASCII Now it has System A (Latin Script) as a first option and System B (ASCII) as a second option for each entry.
Created attachment 11292 [details] Transliteration table Cyrillic->Latin with fallback to ASCII Now it has System A (Latin Script) as a first option and System B (ASCII) as a second option for each entry. Added some explanation and reference to ISO 9.1995 as a comment.
Created attachment 11293 [details] test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales Now with the characters from ISO 9.1995 GOST 7.79_System_A that go beyond current Russian alphabet and hopefully cover all relevant Cyrillic letters for transliteration.
https://sourceware.org/ml/libc-locales/2018-q4/msg00013.html After some kind help from Marko in the offline discussion I realized the multi/single character approach I originally took was against the of the iconv(1) logic anyway. So there is no harm in dropping it and adopting Marko's suggestion instead. I will do so and will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to GOST 7.79 System B (for ASCII). However this doesn't resolve the issue for ASCII part being different for various locales. Again, I am offering the locale maintainers to let me know if they want to 1) adopt the one I am supplying, 2) write their own or 3) ignore the patch altogether. Your feedback is appreciated! This is the relevant part that helped: > The first part (ISO-8859-15 or ASCII) defines the target encoding for > iconv(1). //TRANSLIT is described in the iconv(1) man page as: > > If the string //TRANSLIT is appended to to-encoding, characters > being converted are transliterated when needed and possible. This > means that when a character cannot be represented in the target > character set, it can be approximated through one or sev‐ eral > similar looking characters. Characters that are outside of the > target character set and cannot be transliterated are replaced > with a question mark (?) in the output. > > So in the above examples, iconv(1) encounters the character U+0428 > which is not part of either of the target encoding and since > //TRANSLIT is specified, iconv(1) tries transliteration according to > the rules defined above, in case of ASCII U+0160 is not part of the > target encoding so the next alternative is used.
Created attachment 11298 [details] LibreOffice Calc spreadsheet used to create the translit_cyrillic file with the transliteration Now more graphic with Cyrillic-ISO9translit-ASCIItranscription on the same page and colored out.
Created attachment 11299 [details] LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration Fixed issues identified here https://sourceware.org/ml/libc-locales/2018-q4/msg00019.html
Created attachment 11300 [details] LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration Now with the "Result Export to CSV" tab that can be exported into txt and copypasted into translit_cyrillic after removing trailing spaces. If we could code the rules implemented in the formulas this could become the generator script. Worksheet "ISO 9.1995 System A GOST 9.97 System B" columns contain the actual mapping and the rest is just the logic to get to "Result Export to CSV" glibc format for translit_cyrillic.
Created attachment 11301 [details] LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration Removed "" around the "<U0423><U0301>" (<U00DA>) and "<U0443><U0301>" (<U00FA>) as it was breaking locale compliation. It works now with % CYRILLIC UNDEFINED <U0423><U0301> <U00DA>;"<U0055><U0060>" % CYRILLIC UNDEFINED <U0443><U0301> <U00FA>;"<U0075><U0060>"
Created attachment 11302 [details] Transliteration table Cyrillic->Latin with fallback to ASCII Final version in preparation for the patch.
Created attachment 11303 [details] the patch adding translit_cyrillic and including it into locales
Created attachment 11304 [details] test file for https://sourceware.org/glibc/wiki/Locales#Testing_Locales Added the cyrillic characters covering Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 as CYRILLIC COMPLETE. Renamed CYRILLIC to CYRILLIC RUSSIAN and added all capital letters to the text there.
Created attachment 11316 [details] the patch adding translit_cyrillic and including it into locales > > "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII". > >> +% Inspired by ISO 9.1995 / GOST 7.79-2000. >> +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf >> +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995 > > Typos: > > "i.e" -> "i.e.," (somebody please fix me if I'm wrong here) > "U4001" - I guess you meant "U0401" > "U4F9" -> "U04F9". I think that "U4F9" is not definitely bad but > let's be consistent. These are all good catches. I will fix them and resubmit. [FIXED]
Created attachment 11317 [details] Transliteration table Cyrillic->Latin with fallback to ASCII typos
Created attachment 11334 [details] the patch adding translit_cyrillic and including it into locales now against the glibc 2.28 source
Created attachment 11335 [details] copy of localedata/bug-iconv-trans.c for cyrillic saved as UTF-8 as opposed to the original ISO-8859-15 --- bug-iconv-trans.c 2018-10-15 11:53:51.509030034 +0000 +++ bug-iconv-trans-cyr.c 2018-10-15 11:54:02.385071250 +0000 @@ -7,8 +7,8 @@ main (void) { iconv_t cd; - const char str[] = "�������"; - const char expected[] = "AEaeOEoeUEuess"; + const char str[] = "CyrillicLetters_ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"; + const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"; char *inptr = (char *) str; size_t inlen = strlen (str) + 1; char outbuf[500]; @@ -23,7 +23,7 @@ return 1; } - cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "ISO-8859-1"); + cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8"); if (cd == (iconv_t) -1) { puts ("iconv_open failed"); @@ -31,7 +31,7 @@ } n = iconv (cd, &inptr, &inlen, &outptr, &outlen); - if (n != 7) + if (n != 174) { if (n == (size_t) -1)
Created attachment 11340 [details] Transliteration table Cyrillic->Latin with fallback to ASCII fixed capitalisation for the historic letters
Created attachment 11341 [details] the patch adding translit_cyrillic and including it into locales updating timestamps
Created attachment 11396 [details] the patch adding translit_cyrillic and including it into locales * Fixed formatting (trailing spaces etc.) * Put commit summary in the patch file, now it is generated completely by git format-patch
Created attachment 11402 [details] The patch adding translit_cyrillic and including it into locales
Created attachment 11403 [details] Transliteration table Cyrillic-> ASCII Stripped System A. File now only has ISO 9:1995/GOST_7.79 System B
Created attachment 11404 [details] LibreOffice Calc spreadsheet used to generate the translit_cyrillic file with the transliteration with capitalisation of the transcription of capital letters and System B table for export
Created attachment 11442 [details] The patch adding Cyrillic translit to locale/C-translit.h.in * Re-targeted the patch against locale/C-translit.h.in as the proper file for the ASCII translit table. * Correspondingly the patch now only contains the additional Cyrillic-ASCII strings in the format of locale/C-translit.h.in table. The 'include "translit_cyrillic";""' directives are not necessary in the locale files and they are now all left intact. * Also the file translit_cyrillic is not longer needed and is omitted.
Created attachment 11443 [details] LibreOffice Calc spreadsheet used to generate the Cyrillic transliteration table now with the rows in the format of locale/C-translit.h.in
Comment on attachment 11403 [details] Transliteration table Cyrillic-> ASCII a separate file not needed anymore
Changing the "component" parameter of this bug to the "locale" because ASCII is the target character set for the C locale and it is residing in locale/C-translit.h.in.
Created attachment 11505 [details] The patch adding Cyrillic translit to locale/C-translit.h.in
Added as release blocker for 2.30 on suggestion of Siddhesh Poyarekar https://sourceware.org/ml/libc-alpha/2019-04/msg00566.html
The master branch has been updated by Rafal Luzynski <rl@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c7e4b684e77323d1ef85dcdde8a41411ebe3b581 commit c7e4b684e77323d1ef85dcdde8a41411ebe3b581 Author: Egor Kobylkin <egor@kobylkin.com> Date: Wed Jan 2 05:50:13 2019 +0100 locale/C-translit.h.in: Cyrillic -> ASCII transliteration [BZ #2872] This patch adds Cyrillic to plain ASCII transliteration table according to GOST 7.79-2000 System B standard to the C locale. [BZ #2872] * locale/C-translit.h.in: Add Cyrillic transliteration.
Cyrillic to plain ASCII added by commit c7e4b684e77323d1ef85dcdde8a41411ebe3b581.