This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29

From: Egor Kobylkin <egor at kobylkin dot com>
To: Marko Myllynen <myllynen at redhat dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>, Keld Simonsen <keld at keldix dot com>
Cc: libc-alpha at sourceware dot org, libc-locales at sourceware dot org, "Dmitry V. Levin" <ldv at altlinux dot org>, Volodymyr Lisivka <vlisivka at gmail dot com>, Carlos O'Donell <carlos at redhat dot com>, Max Kutny <mkutny at gmail dot com>, danilo at gnome dot org
Date: Fri, 5 Oct 2018 14:00:02 +0200
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <16e785f3-2e9f-ceb2-698f-dc33c91a5d5e@kobylkin.com> <ac4c9b3e-aeae-30de-23ef-24d8f53d7bc4@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com> <bda2ca60-18f1-3b19-91e5-c9ad144bc834@redhat.com>

Hi Marko,

I have chosen the System B because it is ASCII compartible. System A is
not ASCII compartible (diacritics in target).

https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A
"GOST 7.79 contains two transliteration tables.

System A
    one Cyrillic character to one Latin character, some with diacritics
– identical to ISO 9:1995

System B
    one Cyrillic character to one or many Latin characters without
diacritics
"
Hope this helps,
Egor

On 05.10.2018 13:54, Marko Myllynen wrote:
> Hi,
> 
> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if
> possible and if not, then fall back to GOST 7.79 System B?
> 
> Implementation-wise current translit_* files have few examples where a
> non-ASCII transliteration is tried first before an ASCII fallback. These
> examples are from translit_neutral:
> 
> % NARROW NO-BREAK SPACE
> <U202F> <U00A0>;<U0020>
> % REVERSED TRIPLE PRIME
> <U2037> "<U2035><U2035><U2035>";"<U0060><U0060><U0060>"
> 
> Thanks,
> 
> On 2018-10-05 13:29, Egor Kobylkin wrote:
>> Keld,Marko,Rafal, other locale maintainers,
>>
>> this all is written with having in mind a minimal viable fix for this
>> bug asap. I want to avoid wasting maintainers time getting into
>> fundamental discussions here (although for perfectly good reasons).
>>
>> I see three options:
>> 1. those locale maintainers that are fine with using ISO
>> 9:1995/GOST_7.79_System_B cyrillic transliteration table (Ru) include it
>> in their locales (see attached screenshot of the table).
>> 2. those that that want to have a differing table can create their own
>> variety based on the spreadsheet I have prepared
>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and include it in
>> this patch.
>> 3. those that want to omit a cyrillic transliteration altogether for now
>> state so and just carry over the bug #2872 from the year 2006.
>>
>> Does this make sense to you?
>>
>> Just to be super clear on this: the patch is a stopgap _ASCII_
>> transliteration table. ASCII being AMERICAN Standard Code for
>> Information Interchange, that is obviously orthogonal to any
>> transliteration rule of other countries. As such it is not explicitly
>> targeting transliteration standards of any country.
>>
>> The fact that the patch is reflecting Russian variety of ISO
>> 9:1995/GOST_7.79_System_B is because a) ISO 9:1995/GOST_7.79_System_B is
>> available and can be helpful to a majority of cyrillic users b) I have
>> access to it including via being proficient in Russian.
>>
>> It is offered to all the respective locale maintainers as a stopgap
>> solution. Stopgap in the sense that it is better to have some
>> transliteration than not to have any at all and carry over the bug from
>> 2006. That it may be a somewhat officially correct transliteration for
>> ru_RU is a bonus. In that sense I would dub the discussion on the
>> correctness for other languages "offtopic". Let me know if this is not OK.
>>
>> You are all are correctly mentioning the deficiencies of this approach.
>> However, I couldn't find a better straightforward approach as of yet.
>> Happy to hear from you as on how this could be handled.
>>
>> There is a danger of being caught in the web of language/country
>> differences. I propose just pruning the locales that are not comfortable
>> including this current table. We can address possible solutions in the
>> second wave of patching.
>>
>> I am vary of getting into discussions on specific country variants just
>> because of the sheer complexity of this topic. It is probably better
>> addressed by respective maintainers of their locales. I do not see a
>> "one fits all" solution in this first wave possible.
>>
>> I would like to have this "three options plan of action" vetted first
>> and then we could go to the specific detail. (Like, for instance, what
>> characters should be included in to the table, and in which
>> transliteration form.)
>>
>> I am looking forward to your reply,
>> Egor Kobylkin
>>
>> P.S. specifically as to how address languages other than Ru included in
>> GOST_7.79_System_B: we can take the first option left to right from that
>> table (Ru,By,Uk,Bg,Mk). Then it will technically work for all those
>> locales/languages but with errors where Ru supersedes their own variants.
>>
>>
>> On 05.10.2018 11:20, Rafal Luzynski wrote:
>>> 3.10.2018 11:32 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>
>>>> On 03.10.2018 11:19, Keld Simonsen wrote:
>>>>> Hi
>>>>>
>>>>> Please note that translitteration of Cyrillic to latin is not universal.
>>>>> There are different schemes for for example German, English and Danish, and
>>>>> there is also an ISO standard for it.
>>>>
>>>> Thanks for your feedback, Keld!
>>>>
>>>> Could the locale maintainers that wouldn't like to include this patch
>>>> explicitly state so here?
>>>
>>> I think it is about me so I must reply.  I am sorry about that and the sole
>>> reason is my lack of time.  I'm just a volunteer here, that means it's not
>>> my regular job to work on locale data nor anything in glibc nor in any other
>>> open source project.  I do these things only in my free time which I don't
>>> have much.  Of course you will see my contributions here and there but they
>>> are either trivial or take me months to complete.  Your patches are on my
>>> radar but I can't tell any ETA for them.  Of course, there are other people
>>> around here and they are all welcome to come and join.
>>>
>>>> That is:
>>>> - In the case that there is a different preferred cyrillic
>>>> transliteration table for any specific locale their maintainers may want
>>>> to point me to it so I can supply a separate table/patch.
>>>> - Or they could state explicitly that for some reason they would like to
>>>> exclude their locale from the patch for a default cyrillic
>>>> transliteration altogether.
>>>
>>> As Keld wrote, there are probably separate rules for every language so
>>> I don't think you should treat your rules as universal and include them
>>> in every locale.  At first sight, it seems to me they work only for English
>>> (as a destination locale).  Also, although it is called "transliteration
>>> from Cyrillic" it seems that it covers only Russian alphabet.  What about
>>> other languages which use Cyrillic alphabet but add their own diacritic
>>> characters?  Think about Belarusian, Ukrainian, Serbian, Chechen, Chuvash,
>>> Mari, Ossetian, Yakut, Tatar, and more.  What about languages which use
>>> Cyrillic alphabet but transliterate their respective letters in a different
>>> way than Russian?  For example, Russian "Ъ" is (I think) usually skipped
>>> in transliteration, I think you propose "``", but when transliterating from
>>> Bulgarian they usually transliterate this as "ă".
>>>
>>> Few remarks:
>>>
>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be better?
>>> * You transliterate "ц" as "cz", wouldn't "ts" be better?  By the way,
>>>   in Polish language "cz" is a correct transliteration of "ч".
>>> * You transliterate "й" as "j", this is fine in many languages but wouldn't
>>>   "y" be better in English?
>>> * In case of "е": how will you know if it is correct to transliterate it
>>>   to "e" or "ie" or "je" or "ye"?
>>>
>>> These remarks are obviously incomplete, your patch deserves much more
>>> attention to review.
>>>
>>> Best regards,
>>>
>>> Rafal
>>>
>>
> 
>

Follow-Ups:
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Marko Myllynen

References:
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Egor Kobylkin
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Keld Simonsen
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Egor Kobylkin
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Rafal Luzynski
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
  - From: Marko Myllynen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]