[PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Carlos O'Donell codonell@redhat.com
Fri Jun 7 21:17:00 GMT 2019


On 6/7/19 8:59 AM, Diego (Egor) Kobylkin wrote:
> On Friday, June 7, 2019 2:35 PM, Carlos O'Donell
> <codonell@redhat.com> wrote:
>> I'd like to hear what Egor has to say about the data loss aspects.
> 
> It's quite simple really - suppose you have a list of pages in an
> wikipedia.
> 
> For example there are these two entries in Russian: 1.Шема
> https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no
>
>  2.СÑ
ема
> https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0
> 
> 
> So you want to scrape wikipedia and them out to files: Шема.txt and
> СÑ
ема.txt But the target system doesn't support Russian locale and so
> you must transliterate the filenames.
> 
> 
> If "Ш"->"Sh" and "СÑ
"->"Sh", both of them will be written into the
> same file "Shema.txt". With no other special handing the first file
> will be overwritten and its data lost.
> 
> If "Ш"->"SH" and "СÑ
"->"Sh" - there will be two separate files 1.
> SHema.txt 2. Shema.txt . No data loss in this case.
  
Agreed.
  
> We cant exclude all data loss scenarios but at least shouldn't
> knowingly let the most basic ones happen just because how SHema
> looks. Translit is mostly a technical field at least in glibc so the
> aesthetics would be the last thing I would care about here.
> 
> 
> Anyway I'm all for committing the patch this way or another and
> opening a new bug should anyone complain about Sh/SH. Until now we
> had a hard time getting any input from any outsider on this issue. I
> guess de-facto I am the only end-user that has an opinion on this
> :-)

I appreciate your input.

I expected this example, it's a classic problem with transliteration
that the conversion can result in non-unique representations.

I also think your point about "technical" is relevant here, nobody
really wants to read the transliterated results, they want to read
the original, and providing any hint about the original form has
value.

In glibc we don't have any framework for an intelligent conversion.
We would have to write specific code to handle this case and add
it into the translit code for special handling in this case.

I think we should today leave "Ш"->"SH" and "СÑ
"->"Sh", since it's
the most conservative position that avoids ambiguity, and then we
can discuss the aesthetics of this and the other impacts and solutions.

I appreciate Rafal's position, but I think being conservative here,
even if it's not as pretty as uconv, is a good guiding idea.

-- 
Cheers,
Carlos.



More information about the Libc-locales mailing list