This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING^8][PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]


On 6/7/19 8:59 AM, Diego (Egor) Kobylkin wrote:
On Friday, June 7, 2019 2:35 PM, Carlos O'Donell
<codonell@redhat.com> wrote:
I'd like to hear what Egor has to say about the data loss aspects.

It's quite simple really - suppose you have a list of pages in an
wikipedia.

For example there are these two entries in Russian: 1.Шема
https://ru.wikipedia.org/w/index.php?title=%D0%A8%D0%B5%D0%BC%D0%B0&redirect=no

 2.Схема
https://ru.wikipedia.org/wiki/%D0%A1%D1%85%D0%B5%D0%BC%D0%B0


So you want to scrape wikipedia and them out to files: Шема.txt and
Схема.txt But the target system doesn't support Russian locale and so
you must transliterate the filenames.


If "Ш"->"Sh" and "Сх"->"Sh", both of them will be written into the
same file "Shema.txt". With no other special handing the first file
will be overwritten and its data lost.

If "Ш"->"SH" and "Сх"->"Sh" - there will be two separate files 1.
SHema.txt 2. Shema.txt . No data loss in this case.
Agreed.
We cant exclude all data loss scenarios but at least shouldn't
knowingly let the most basic ones happen just because how SHema
looks. Translit is mostly a technical field at least in glibc so the
aesthetics would be the last thing I would care about here.


Anyway I'm all for committing the patch this way or another and
opening a new bug should anyone complain about Sh/SH. Until now we
had a hard time getting any input from any outsider on this issue. I
guess de-facto I am the only end-user that has an opinion on this
:-)

I appreciate your input.

I expected this example, it's a classic problem with transliteration
that the conversion can result in non-unique representations.

I also think your point about "technical" is relevant here, nobody
really wants to read the transliterated results, they want to read
the original, and providing any hint about the original form has
value.

In glibc we don't have any framework for an intelligent conversion.
We would have to write specific code to handle this case and add
it into the translit code for special handling in this case.

I think we should today leave "Ш"->"SH" and "Сх"->"Sh", since it's
the most conservative position that avoids ambiguity, and then we
can discuss the aesthetics of this and the other impacts and solutions.

I appreciate Rafal's position, but I think being conservative here,
even if it's not as pretty as uconv, is a good guiding idea.

--
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]