This is the mail archive of the
mailing list for the glibc project.
[PING^5][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
- From: Egor Kobylkin <egor at kobylkin dot com>
- To: Marko Myllynen <myllynen at redhat dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org, Carlos O'Donell <carlos at redhat dot com>, Siddhesh Poyarekar <siddhesh at gotplt dot org>, Rafal Luzynski <digitalfreak at lingonborough dot com>
- Cc: Mike Fabian <mfabian at redhat dot com>
- Date: Thu, 4 Apr 2019 21:44:00 +0200
- Subject: [PING^5][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
- References: <firstname.lastname@example.org> <20180412224352.GB2911@altlinux.org> <email@example.com>
- Reply-to: Egor Kobylkin <egor at kobylkin dot com>
On 19/03/2019 12.39, Egor Kobylkin wrote:
* Adjusted to the new comment style suddenly appearing in the target
file locale/C-translit.h.in (the original file changed on the master
branch from /* style to # style since v11)
* Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
"sh`" instead of erroneous "SH`" in v11
* Re-targeted the patch against locale/C-translit.h.in as the proper
file for the ASCII translit table.
* Correspondingly the patch now only contains the additional
Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
The 'include "translit_cyrillic";""' directives are not necessary in the
locale files and they are now all left intact.
* Also the file translit_cyrillic is not longer needed and is omitted.
* Edited below email, commit message.
* Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
with diacritics) as conflicting with System B within glibc mechanics and
not solving BZ #2872
* Edited below email, commit message, comment in translit_cyrillic to
reflect System A removal
* Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
using composition) as composing is not covered by current glibc
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).
* Generated against git://sourceware.org/git/glibc.git master with git
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
to sequences of all uppercase Latin letters in all languages (whenever
a Cyrillic letter is transliterated to more than one Latin letter),
for example "Ї" is now transliterated as "YI" rather than "Yi".
Dear locale maintainers,
fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
add the Cyrillic transliteration rows to locale/C-translit.h.in.
The patch is attached.
Current bug effect:
The glibc wiki explicitly lists this use case as the test example and
currently it fails on Cyrillic texts   :
iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
- it produces a string of question marks and spaces.
This is what it should produce and it does so after the patch applied:
CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
The root problem and the fix:
The root problem is the missing transliteration table that I am
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.
The patch content (mapping) is based on ISO 9.1995 standard  and its
derivative GOST 7.79-2000 System B official source (Federal Agency on
Technical Regulating and Metrology Of Russian Federation ).
Technically an independent but mostly identical source  was used and
prepared in a spreadsheet .
The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
System B represents what is actually called transcription (preserving
phonemes), while System A is the transliteration (preserving graphemes).
There is no meaningful way to preserve graphemes converting Cyrillic to
ASCII and thus the System B is chosen . To be super clear the System
A has nothing to do with this bug regardless it being a transliteration.
Those interested in implementing System A for transliteration of
Cyrillic to Latin with Diacritic as a new feature are welcome to use the
spreadsheet in  as a starting point.
 This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
 GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
 http://transliteration.ru/gost-7-79-2000/ and
 Wikipedia article on Cyrillic transliteration with Latin alphabet
 Spreadsheet for generating translit_cyrillic