This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]
7.06.2019 02:57 Carlos O'Donell <codonell@redhat.com> wrote:
>
> On 6/6/19 5:31 PM, Rafal Luzynski wrote:
> >>> Possible answers (Cyrillic -> Latin Extended -> ASCII):
> >>>
> >>> 1. "Ш" -> "Š" -> "SH"
> >>>
> >>> e.g.: "Шема" -> "Šema" -> "SHema"
> >>> "Схема" ----------> "Shema"
> >>>
> >>> 2. "Ш" -> "Š" -> "Sh"
> >>>
> >>> e.g.: "Шема" -> "Šema" -> "Shema"
> >>> "Схема" ----------> "Shema"
> >>>
> >>> Personally I don't like the answer 1. because "SHema" looks weird
> >>> to me. Egor in turn does not like the answer 2. because the output
> >>> string becomes ambiguous.
> >>>
> >>> Should we maybe have a smart algorithm which would select the title
> >>> case or the upper case of the output characters depending on the
> >>> context in the word? Note that it would not resolve the problem of
> >>> the output text being ambiguous.
> >>
> >> It seems clear that there is no one right/wrong answer but it's a
> >> matter
> >> of preference, especially the way this currently works. It might be an
> >> improvement to output (for instance) SH instead of Sh if all the other
> >> letters of a word are upper-case as well but not sure what would help
> >> with the result being unambiguous.
> >
> > I think you refer to the idea of implementing a smart algorithm which
> > would
> > adapt the lower/upper case depending on the context but indeed it would
> > not resolve the problem of ambiguity.
> >
> > So, the smart algorithm aside, what should be the preferred
> > transliteration
> > rule?
>
> I have a weak preference for 1. However, I would change my preference if
> someone showed me existing prior implementations that did 1 or 2.
uconv implements a smart algorithm to adjust the upper/lower case:
==================================================================
$ echo "Схема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
Skhema
$ echo "Шема" | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Shema
$ echo "ШЕМА" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SHEMA
$ echo "ШЕма" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SHEma
$ echo "Ш Ема" | uconv -f UTF-8 -t ASCII -x ru-ru_Latn/BGN
SH Yema
==================================================================
Also for them it is easier because they decided that "Х" should be
transliterated to "KH" (I think this is the common thing when
transliterating to English) while ISO 9 says it should be transliterated
to "H" and GOST says it should be "X". We can't implement this
fallback in glibc because the glibc algorithm is very simple.
Regards,
Rafal