This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
- From: Rafal Luzynski <digitalfreak at lingonborough dot com>
- To: Egor Kobylkin <egor at kobylkin dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org
- Date: Fri, 16 Nov 2018 23:17:27 +0100 (CET)
- Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
- References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com>
Thank you for working on this, Egor.
Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.
1. I think we need tests for transliteration. Currently there is only
one test program which is similar to what we need,
localedata/bug-iconv-trans.c. It is old and it is not quite clear
what bug it is trying to test. Therefore I think we need a new
framework to test transliteration. Is it a good idea to base the
test on the iconv(1) command line utility which is part of glibc?
2. I made few tests in the command line and it seems to me that the
transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does
not work and has not been working for some time already because
I've checked some older systems as well and the result is always
the same. I think that the reason is that uk_UA defines multiple
transliteration rules for "З" depending on what is the letter following
it. It does not seem to work. AFAIK the reason is that the syntax of
transliteration rules says that a single non-Latin character may map
one or more Latin strings, each consisting of one or more characters.
There cannot be a rule transliterating multiple source characters into
one or multiple destination characters. Is it a bug in transliteration
implementation? Or maybe in the specification, including POSIX standard?
The definition of transliteration says that it is one-to-one mapping
of graphemes while a grapheme may be one or multiple characters.
It does not have to be always mapping one-to-one character. Should we
fix this bug first, make uk_UA transliteration work, and only then
add a generic Cyrillic transliteration? Egor's patch already contains
transliteration of "У" + combining acute accent to "Ú" which most
probably
will not work.
I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.
Egor, while at this I was thinking about your idea to transliterate letters
like "Ш" (uppercase) to "SH" (always uppercase) in order to distinguish
between "Шема" (-> "SHema") and "Схема" (-> "Shema" or "Sxema"). Also
you include a rule to transliterate "Х" to "H" or "X" depending on which
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used. I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter. What if we:
* drop the rule of transliterating "Х" to "H" and transliterate always to
"X",
* transliterate uppercase "Ш" to "Sh" (so it will work fine for titlecase
words)?
As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "Х" and therefore will never cause a conflict.
Examples:
* "Шема" -> "Shema",
* "Схема" -> "Sxema".
Will this solve the problem?
Regards,
Rafal