This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
- From: Siddhesh Poyarekar <siddhesh at gotplt dot org>
- To: Rafal Luzynski <digitalfreak at lingonborough dot com>, libc-alpha at sourceware dot org
- Date: Wed, 2 Jan 2019 23:35:13 +0530
- Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
- References: <1042796605.674608.1545262199252@poczta.nazwa.pl>
I've tried to overcome my general lack of confidence in commenting on
locale related issues to provide some opinions. I'd take those with an
appropriate dose of salt since like I've said before, I have little
experience in this area.
On 20/12/18 4:59 AM, Rafal Luzynski wrote:
* Should we take the title of the bug literally and provide the
transliteration exclusively to plain ASCII or should we support the
transliteration to extended Latin (with some diacritic characters,
as per ISO 9 [2]) and support plain ASCII only as a fallback?
We currently have a patch with ASCII fallback implemented and I reckon
implementing Latin fallback would just be additional work that could be
done in a second phase if really necessary. In that sense, I'd think
ASCII is sufficient as a first pass.
* Should we agree for Cyrillic -> extended Latin -> ASCII even if the
ASCII fallback does not fully conform with any existing standard?
I have no idea what standards govern this, so I have no opinion on it.
* Should we implement Cyrillic -> plain ASCII as per GOST System B [3]
and skip extended Latin if it is impossible to handle both for standards
technical reasons?
Sounds reasonable.
* Is the C builtin locale the correct place to put this transliteration?
If yes, should we think about including the support of other alphabets
as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on)
ever in future?
Yes on both counts, although this could result in bloating of the C
locale. If we are to provide additional transliteration of this sort,
we probably need to provide some way to trim it.
* Should the Cyrillic transliteration work in every locale (possibly with
few exceptions) or should we require that a locale actually using
Cyrillic script must be used? (E.g., should it work when ru_RU is not
installed? should it work if en_US is the only locale installed? Should
it work when no locale is installed, even en_US?)
Would it matter if it was in the C builtin locale?
* Is it required that transliteration produces unambiguous output which
means that two different original strings never produce the same result?
(As a consequence, the reverse transliteration could be possible).
I don't think so. Transliterations are approximations in the end and
striving for such guarantees might be overreach.
Additionally we have a disagreement about how should we handle the case
when a single original uppercase character transliterates into a digraph
in ASCII. Should both ASCII characters be uppercase (which is good for
all uppercase strings and also good to emphasize that the original character
was single rather than two separate characters which accidentally
transliterate into two characters making a digraph) or should only the first
ASCII character be uppercase (which is good for the titlecase words which is
common in natural texts)? An example is "Ш" - should it be "SH" or "Sh"?
Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").
Is that important?
We are lucky that some of existing glibc locales already handle
transliteration
from Cyrillic to Latin, for example sr_RS and uk_UA. Unfortunately, they
follow their national standards rather than ISO or GOST so they cannot
be copied directly to ru_RU or applied universally to all locales.
Also, taking Egor's work into account, can we include this bug into the
list of desirable to be fixed in 2.29?
It's late for 2.29 (sorry, it's partly my fault for not being decisive
enough about it) but please continue reviewing so that it lands first
thing in 2.30. It also looks like something that can be safely
backported assuming that it does not affect translations, so you could
well do that for 2.29 or as far back as you like.
Siddhesh