This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v8] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]


Moving everybody from To: and CC: on BCC. It seems at this stage it is
Rafal and me. It is still going to libc-alpha and libc-locales. If you
are interested to be put back on CC - please let me know.

On 02.11.18 23:22, Rafal Luzynski wrote:
>> * Consistently transliterate single uppercase Cyrillic letters to 
>> sequences of all uppercase Latin letters in all languages
>> (whenever a Cyrillic letter is transliterated to more than one
>> Latin letter), for example "Ї" is now transliterated as "YI" rather
>> than "Yi".
> 
> I think you have not yet explained whether this is required by any
> existing standard (please provide links) or whether this is your
> genuine idea to distinguish between the cases like "Ш" transliterated > to "Sh" and
 "Сх" also transliterated to "Sh".

I remember seeing this form of the capitalization it in actual
transliterated texts long time ago but can't find a formal description
as of now. Just don't want to claim this to be my original idea.

>> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for
>> example for "Сх" and "Ш" that would both transliterate to Sh:
>> With SH:"Схема"->"Shema" but "Шема"->"SHema"
>> With Sh:"Схема"->"Shema" and "Шема"->"Shema". Collision!
>> This is important e.g. for renaming files, grouping as in using uniq >> etc.

As for the users - I am a user and I have demonstrated the use cases
where the collisions due to "one symbol capitalization" would cause
irreversible damage to data. For a library like glibc this seems like a
relevant issue to consider.

The "two symbol capitalization" on the other hand would prevent
collision and can be easily corrected in the userspace if needed
with something like

foo="SHema"
foo="${foo:0:1}$(tr '[:upper:]' '[:lower:]' <<<${foo:1})"
echo "$foo"
Shema

It looks like everyone really using transliteration for something
sensitive already have done it the userspace since at least 2006 when
this bug was first logged. So we won't brake the official use cases
where the capitalization should be done in a certain way. But we will
prevent new bugs due to collision if we use "two symbol capitalization"
indeed.

Happy to hear arguments to the contrary.

Bests,
Egor Kobylkin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]