This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30
- From: Rafal Luzynski <digitalfreak at lingonborough dot com>
- To: Marko Myllynen <myllynen at redhat dot com>, Egor Kobylkin <egor at kobylkin dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org, Carlos O'Donell <carlos at redhat dot com>
- Cc: Siddhesh Poyarekar <siddhesh at gotplt dot org>, Mike Fabian <mfabian at redhat dot com>
- Date: Sat, 20 Apr 2019 00:24:21 +0200 (CEST)
- Subject: Re: [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872] ping for 2.30
- References: <firstname.lastname@example.org> <20180412224352.GB2911@altlinux.org> <email@example.com> <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org>
Thank you Siddhesh and Carlos for your involvement in testing this
patch and I apologize Egor and Marko and everyone else who need this
patch to be pushed for my poor involvement. I'd like to reply to
this email from Marko because it summarizes all issues. Also I hope
I will explain the problems which made me stuck.
14.02.2019 17:48 Marko Myllynen <email@example.com> wrote:
> 1) Built-in C locale doesn't read/use any translit_* files and it can't
> have any fallback mechanisms and it only supports ASCII so using GOST
> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
> be the appropriate way to implement Cyrillic transliteration for the
> built-in C locale (it adds some 8KB to the binary).
This sounds like a good idea.
Also, C locale is probably a good way to enforce the plain ASCII
transliteration without any fallback.
> 2) Other locales read/use translit_* files and with them fallbacks and
> non-ASCII are possible so it would seem preferable to first try ISO 9 /
> GOST 7.79 System A
OK, we agree here.
> and only if that fails then use GOST 7.79 System B
> (in which case the end result should match with the built-in C locale).
This is impossible due to this case. System A transliterates the Cyrillic
"Х" to Latin "H", system B transliterates it to Latin "X". Transliteration
as implemented in glibc supports a simple fallback algorithm: transliterate
the letter "X" to "YY" but if it is not available then to "ZZ". It can't
support the complex algorithm which we need here: transliterate "X" to "YY"
but if "Q" cannot be transliterated to "RR" then transliterate "X" to "ZZ".
In our case we would like to transliterate "Х" to "X" if "Ш" cannot be
transliterated to "Š". The only thing we can implement is a fallback
transliteration which is similar to System B but not 100% compatible.
This is not the case if we are going to implement only System B in C locale
because we know already that "Š" is unavailable so we have to transliterate
"Х" to "X" always.
> For this the translit_cyrillic file should be added (as per patch v9 +
> changes mentioned in patches v10 and v12).
> 3) Individual locale files can then be updated to use translit_cyrillic
> as appropriate (see patch v9) and language/national specific conventions
> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.
Sometimes I wonder whether really any other locale than a language which
uses the Cyrillic script should want to have a Cyrillic transliteration
but on the other hand - why not.
Also I'd like to reiterate other disagreements which we have here:
1. How to handle upper/lower case in System B? Should we transliterate
"Ш" to "SH" or "Sh"? Should we maybe implement a smart context based
casing algorithm first? I mean the algorithm which would detect if
an uppercase letter appears as the first letter of otherwise lowercase
word so should be transliterated as "Sh", or maybe it's in a context
of a fully uppercase word so should be transliterated as "SH".
I think that uconv implements this algorithm.
2. How to handle ambiguous transliterations like "Схема" -> "Shema"
vs. "Шема" -> "Shema"? "SHema"?
3. How to handle the characters which are proper letters in Cyrillic
and have an upper and lower case like a hard and soft sign but are
transliterated to punctuation characters (grave accent "`")?
Should we transliterate upper and lower case to the same character
or should we mark them somehow? uconv adds Unicode combining low
line to the grave accent (so the output is "`̲") if the original
Cyrillic character was uppercase. But this is unavailable if
our target charset is ASCII.
Regarding the test cases which I mentioned the other day I discussed
this with Dmitry and he convinced me that requiring the test cases is
the bar set too high so I agree we don't need to require them already.