This is the mail archive of the
mailing list for the glibc project.
[PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
- From: Marko Myllynen <myllynen at redhat dot com>
- To: libc-alpha at sourceware dot org, libc-locales at sourceware dot org, Carlos O'Donell <carlos at redhat dot com>, Siddhesh Poyarekar <siddhesh at gotplt dot org>, Rafal Luzynski <digitalfreak at lingonborough dot com>
- Cc: Mike Fabian <mfabian at redhat dot com>, Egor Kobylkin <egor at kobylkin dot com>
- Date: Tue, 16 Apr 2019 10:15:33 +0300
- Subject: [PING^6][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
- References: <email@example.com> <20180412224352.GB2911@altlinux.org> <firstname.lastname@example.org>
- Reply-to: Marko Myllynen <myllynen at redhat dot com>
On 19/03/2019 12.39, Egor Kobylkin wrote:
> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
> to sequences of all uppercase Latin letters in all languages (whenever
> a Cyrillic letter is transliterated to more than one Latin letter),
> for example "Ї" is now transliterated as "YI" rather than "Yi".
> Dear locale maintainers,
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
> The patch is attached.
> Current bug effect:
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts   :
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
> - it produces a string of question marks and spaces.
> This is what it should produce and it does so after the patch applied:
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> The root problem and the fix:
> The root problem is the missing transliteration table that I am
> supplying here.
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
> The patch content (mapping) is based on ISO 9.1995 standard  and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation ).
> Technically an independent but mostly identical source  was used and
> prepared in a spreadsheet .
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen . To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in  as a starting point.
>  This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
>  GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
>  http://transliteration.ru/gost-7-79-2000/ and
>  Wikipedia article on Cyrillic transliteration with Latin alphabet
>  http://man7.org/linux/man-pages/man5/locale.5.html
>  Spreadsheet for generating translit_cyrillic
>  https://sourceware.org/glibc/wiki/Locales#Testing_Locales
>  translit-test-input.txt
>  https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> Best regards,
> Egor Kobylkin