This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]


I've tried to overcome my general lack of confidence in commenting on locale related issues to provide some opinions. I'd take those with an appropriate dose of salt since like I've said before, I have little experience in this area.

On 20/12/18 4:59 AM, Rafal Luzynski wrote:
* Should we take the title of the bug literally and provide the
   transliteration exclusively to plain ASCII or should we support the
   transliteration to extended Latin (with some diacritic characters,
   as per ISO 9 [2]) and support plain ASCII only as a fallback?

We currently have a patch with ASCII fallback implemented and I reckon implementing Latin fallback would just be additional work that could be done in a second phase if really necessary. In that sense, I'd think ASCII is sufficient as a first pass.

* Should we agree for Cyrillic -> extended Latin -> ASCII even if the
   ASCII fallback does not fully conform with any existing standard?

I have no idea what standards govern this, so I have no opinion on it.

* Should we implement Cyrillic -> plain ASCII as per GOST System B [3]
   and skip extended Latin if it is impossible to handle both for standards
   technical reasons?

Sounds reasonable.

* Is the C builtin locale the correct place to put this transliteration?
   If yes, should we think about including the support of other alphabets
   as well (like extended Latin -> plain ASCII, Greek -> Latin, and so on)
   ever in future?

Yes on both counts, although this could result in bloating of the C locale. If we are to provide additional transliteration of this sort, we probably need to provide some way to trim it.

* Should the Cyrillic transliteration work in every locale (possibly with
   few exceptions) or should we require that a locale actually using
   Cyrillic script must be used? (E.g., should it work when ru_RU is not
   installed? should it work if en_US is the only locale installed? Should
   it work when no locale is installed, even en_US?)

Would it matter if it was in the C builtin locale?

* Is it required that transliteration produces unambiguous output which
   means that two different original strings never produce the same result?
   (As a consequence, the reverse transliteration could be possible).

I don't think so. Transliterations are approximations in the end and striving for such guarantees might be overreach.

Additionally we have a disagreement about how should we handle the case
when a single original uppercase character transliterates into a digraph
in ASCII.  Should both ASCII characters be uppercase (which is good for
all uppercase strings and also good to emphasize that the original character
was single rather than two separate characters which accidentally
transliterate into two characters making a digraph) or should only the first
ASCII character be uppercase (which is good for the titlecase words which is
common in natural texts)?  An example is "Ш" - should it be "SH" or "Sh"?
Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").

Is that important?

We are lucky that some of existing glibc locales already handle
transliteration
from Cyrillic to Latin, for example sr_RS and uk_UA.  Unfortunately, they
follow their national standards rather than ISO or GOST so they cannot
be copied directly to ru_RU or applied universally to all locales.

Also, taking Egor's work into account, can we include this bug into the
list of desirable to be fixed in 2.29?

It's late for 2.29 (sorry, it's partly my fault for not being decisive enough about it) but please continue reviewing so that it lands first thing in 2.30. It also looks like something that can be safely backported assuming that it does not affect translations, so you could well do that for 2.29 or as far back as you like.

Siddhesh


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]