This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

Thank you for working on this, Egor.

Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.

1. I think we need tests for transliteration.  Currently there is only
   one test program which is similar to what we need,
   localedata/bug-iconv-trans.c.  It is old and it is not quite clear
   what bug it is trying to test.  Therefore I think we need a new
   framework to test transliteration.  Is it a good idea to base the
   test on the iconv(1) command line utility which is part of glibc?

2. I made few tests in the command line and it seems to me that the
   transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does
   not work and has not been working for some time already because
   I've checked some older systems as well and the result is always
   the same.  I think that the reason is that uk_UA defines multiple
   transliteration rules for "З" depending on what is the letter following
   it.  It does not seem to work.  AFAIK the reason is that the syntax of
   transliteration rules says that a single non-Latin character may map
   one or more Latin strings, each consisting of one or more characters.
   There cannot be a rule transliterating multiple source characters into
   one or multiple destination characters.  Is it a bug in transliteration
   implementation?  Or maybe in the specification, including POSIX standard?
   The definition of transliteration says that it is one-to-one mapping
   of graphemes while a grapheme may be one or multiple characters.
   It does not have to be always mapping one-to-one character.  Should we
   fix this bug first, make uk_UA transliteration work, and only then
   add a generic Cyrillic transliteration?  Egor's patch already contains
   transliteration of "У" + combining acute accent to "Ú" which most
   will not work.

I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.

Egor, while at this I was thinking about your idea to transliterate letters
like "Ш" (uppercase) to "SH" (always uppercase) in order to distinguish
between "Шема" (-> "SHema") and "Схема" (-> "Shema" or "Sxema").  Also
you include a rule to transliterate "Х" to "H" or "X" depending on which
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used.  I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter.  What if we:

* drop the rule of transliterating "Х" to "H" and transliterate always to
* transliterate uppercase "Ш" to "Sh" (so it will work fine for titlecase

As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "Х" and therefore will never cause a conflict.

* "Шема" -> "Shema",
* "Схема" -> "Sxema".

Will this solve the problem?



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]