[PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Rafal Luzynski digitalfreak@lingonborough.com
Fri Jun 7 11:11:00 GMT 2019


7.06.2019 11:46 "Diego (Egor) Kobylkin" <egor@kobylkin.com> wrote:
> 
> Hi Carlos et al. 
> 
> On Friday, June 7, 2019 2:57 AM, Carlos O'Donell <codonell@redhat.com>
> wrote:
> > I have a weak preference for 1. However, I would change my preference if
> > someone showed me existing prior implementations that did 1 or 2.
> 
> 1. gibc already translits letters and ligatures capitalized in
> locale/C-translit.h.in:
> "\x00c6"	"AE"	# <U00C6> LATIN CAPITAL LETTER AE
> "\x0132"	"IJ"	# <U0132> LATIN CAPITAL LIGATURE IJ

Now I lean to thinking that it is wrong because we don't have
a smart algorithm which would adjust the upper/lower case of
the transliterated letters.  I don't criticize this particular
transliteration rule, just any rule here would be wrong and incomplete
(e.g., "\x00c6" -> "Ae" could be good in some cases but also wrong
in many other).

As a real life example, please fix me if I'm wrong, but AFAIK
in German the umlaut letters like "Ö" are sometimes written
(transliterated) as "OE" but when they appear as the first letter
in a titlecased word they are transliteraded as "Oe", not as "OE"
(e.g., "Österreich" -> "Oesterreich" but not "OEsterreich").

> 2. I would just like to quote myself from 2018: 
> 
> 
> "collisions due to "one symbol capitalization" would cause irreversible
> damage to data. 
> 
> For a library like glibc this seems to be a very relevant issue to
> consider..."
> [...]

Could you please elaborate why it is so important to ensure that
the output data is never ambiguous and what damage to data would
that cause?  OK, you mentioned the case of renaming files.  I believe
that a perfect non-collision algorithm is impossible.  A simple
example when it would never work is when you have two files in
the same directory: one with a name written in Cyrillic and another
one written in Latin using exactly the same name which is the output
of the transliteration algorithm.

Another question: why do you need to transliterate the file names
at all?  Wouldn't it work perfectly for you if they were not
transliterated at all?  My guess is that it might be useful when
using files on some older systems which do not support Unicode.

Maybe let's consider who (and why) should use any transliteration
at all. What comes to my mind is:

1. Countries (languages) which use two writing systems and want
   to have an automatic transliteration of the text. Examples:
   Serbian, Kazakh.
2. Countries (languages) which use non-Latin script but want to
   provide automatically some readable content for foreign visitors.
3. Backward compatibility with some older computer devices which
   are unable to handle Unicode.

Now we may think about what are the requirements of these target
groups and whether we can provide a solution which would work for
all of them.

Regards,

Rafal



More information about the Libc-locales mailing list