This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29
After some kind help from Marko in the offline discussion
I realized the multi/single character approach I originally took was
against the of the iconv(1) logic anyway. So there is no harm in
dropping it and adopting Marko's suggestion instead. I will do so and
will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to
GOST 7.79 System B (for ASCII).
However this doesn't resolve the issue for ASCII part being different
for various locales. Again, I am offering the locale maintainers to let
me know if they want to 1) adopt the one I am supplying, 2) write their
own or 3) ignore the patch altogether. Your feedback is appreciated!
This is the relevant part that helped:
> The first part (ISO-8859-15 or ASCII) defines the target encoding for
> iconv(1). //TRANSLIT is described in the iconv(1) man page as:
>
> If the string //TRANSLIT is appended to to-encoding, characters
> being converted are transliterated when needed and possible. This
> means that when a character cannot be represented in the target
> character set, it can be approximated through one or sev‐ eral
> similar looking characters. Characters that are outside of the
> target character set and cannot be transliterated are replaced
> with a question mark (?) in the output.
>
> So in the above examples, iconv(1) encounters the character U+0428
> which is not part of either of the target encoding and since
> //TRANSLIT is specified, iconv(1) tries transliteration according to
> the rules defined above, in case of ASCII U+0160 is not part of the
> target encoding so the next alternative is used.
Bests,
Egor Kobylkin
On 05.10.2018 14:21, Marko Myllynen wrote:
> Hi,
>
> The scheme I proposed would also be ASCII compatible; consider this
> example:
>
> % CYRILLIC CAPITAL LETTER SHA <U0428> "<U0160>";"<U0053><U0068>"
>
> "printf \\u0428\\n | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv
> -f ISO-8859-15 -t UTF-8" would produce Š as per System A and "printf
> \\u0428\\n | iconv -f UTF-8 -t ASCII//TRANSLIT" would produce Sh as
> per System B.
>
> Thanks,
>
> On 2018-10-05 15:00, Egor Kobylkin wrote:
>> Hi Marko,
>>
>> I have chosen the System B because it is ASCII compartible. System
>> A is not ASCII compartible (diacritics in target).
>>
>> https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A
>>
>>
>>
"GOST 7.79 contains two transliteration tables.
>>
>> System A one Cyrillic character to one Latin character, some with
>> diacritics – identical to ISO 9:1995
>>
>> System B one Cyrillic character to one or many Latin characters
>> without diacritics " Hope this helps, Egor
>>
>> On 05.10.2018 13:54, Marko Myllynen wrote:
>>> Hi,
>>>
>>> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if
>>> possible and if not, then fall back to GOST 7.79 System B?
>>>
>>> Implementation-wise current translit_* files have few examples
>>> where a non-ASCII transliteration is tried first before an ASCII
>>> fallback. These examples are from translit_neutral:
>>>
>>> % NARROW NO-BREAK SPACE <U202F> <U00A0>;<U0020> % REVERSED
>>> TRIPLE PRIME <U2037>
>>> "<U2035><U2035><U2035>";"<U0060><U0060><U0060>"
>>>
>>> Thanks,
>>>
>>> On 2018-10-05 13:29, Egor Kobylkin wrote:
>>>> Keld,Marko,Rafal, other locale maintainers,
>>>>
>>>> this all is written with having in mind a minimal viable fix
>>>> for this bug asap. I want to avoid wasting maintainers time
>>>> getting into fundamental discussions here (although for
>>>> perfectly good reasons).
>>>>
>>>> I see three options: 1. those locale maintainers that are fine
>>>> with using ISO 9:1995/GOST_7.79_System_B cyrillic
>>>> transliteration table (Ru) include it in their locales (see
>>>> attached screenshot of the table). 2. those that that want to
>>>> have a differing table can create their own variety based on
>>>> the spreadsheet I have prepared
>>>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and
>>>> include it in this patch. 3. those that want to omit a
>>>> cyrillic transliteration altogether for now state so and just
>>>> carry over the bug #2872 from the year 2006.
>>>>
>>>> Does this make sense to you?
>>>>
>>>> Just to be super clear on this: the patch is a stopgap _ASCII_
>>>> transliteration table. ASCII being AMERICAN Standard Code for
>>>> Information Interchange, that is obviously orthogonal to any
>>>> transliteration rule of other countries. As such it is not
>>>> explicitly targeting transliteration standards of any country.
>>>>
>>>> The fact that the patch is reflecting Russian variety of ISO
>>>> 9:1995/GOST_7.79_System_B is because a) ISO
>>>> 9:1995/GOST_7.79_System_B is available and can be helpful to a
>>>> majority of cyrillic users b) I have access to it including
>>>> via being proficient in Russian.
>>>>
>>>> It is offered to all the respective locale maintainers as a
>>>> stopgap solution. Stopgap in the sense that it is better to
>>>> have some transliteration than not to have any at all and
>>>> carry over the bug from 2006. That it may be a somewhat
>>>> officially correct transliteration for ru_RU is a bonus. In
>>>> that sense I would dub the discussion on the correctness for
>>>> other languages "offtopic". Let me know if this is not OK.
>>>>
>>>> You are all are correctly mentioning the deficiencies of this
>>>> approach. However, I couldn't find a better straightforward
>>>> approach as of yet. Happy to hear from you as on how this
>>>> could be handled.
>>>>
>>>> There is a danger of being caught in the web of
>>>> language/country differences. I propose just pruning the
>>>> locales that are not comfortable including this current table.
>>>> We can address possible solutions in the second wave of
>>>> patching.
>>>>
>>>> I am vary of getting into discussions on specific country
>>>> variants just because of the sheer complexity of this topic.
>>>> It is probably better addressed by respective maintainers of
>>>> their locales. I do not see a "one fits all" solution in this
>>>> first wave possible.
>>>>
>>>> I would like to have this "three options plan of action"
>>>> vetted first and then we could go to the specific detail.
>>>> (Like, for instance, what characters should be included in to
>>>> the table, and in which transliteration form.)
>>>>
>>>> I am looking forward to your reply, Egor Kobylkin
>>>>
>>>> P.S. specifically as to how address languages other than Ru
>>>> included in GOST_7.79_System_B: we can take the first option
>>>> left to right from that table (Ru,By,Uk,Bg,Mk). Then it will
>>>> technically work for all those locales/languages but with
>>>> errors where Ru supersedes their own variants.
>>>>
>>>>
>>>> On 05.10.2018 11:20, Rafal Luzynski wrote:
>>>>> 3.10.2018 11:32 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>>>
>>>>>> On 03.10.2018 11:19, Keld Simonsen wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Please note that translitteration of Cyrillic to latin
>>>>>>> is not universal. There are different schemes for for
>>>>>>> example German, English and Danish, and there is also an
>>>>>>> ISO standard for it.
>>>>>>
>>>>>> Thanks for your feedback, Keld!
>>>>>>
>>>>>> Could the locale maintainers that wouldn't like to include
>>>>>> this patch explicitly state so here?
>>>>>
>>>>> I think it is about me so I must reply. I am sorry about
>>>>> that and the sole reason is my lack of time. I'm just a
>>>>> volunteer here, that means it's not my regular job to work
>>>>> on locale data nor anything in glibc nor in any other open
>>>>> source project. I do these things only in my free time
>>>>> which I don't have much. Of course you will see my
>>>>> contributions here and there but they are either trivial or
>>>>> take me months to complete. Your patches are on my radar but
>>>>> I can't tell any ETA for them. Of course, there are other
>>>>> people around here and they are all welcome to come and
>>>>> join.
>>>>>
>>>>>> That is: - In the case that there is a different preferred
>>>>>> cyrillic transliteration table for any specific locale
>>>>>> their maintainers may want to point me to it so I can
>>>>>> supply a separate table/patch. - Or they could state
>>>>>> explicitly that for some reason they would like to exclude
>>>>>> their locale from the patch for a default cyrillic
>>>>>> transliteration altogether.
>>>>>
>>>>> As Keld wrote, there are probably separate rules for every
>>>>> language so I don't think you should treat your rules as
>>>>> universal and include them in every locale. At first sight,
>>>>> it seems to me they work only for English (as a destination
>>>>> locale). Also, although it is called "transliteration from
>>>>> Cyrillic" it seems that it covers only Russian alphabet. What
>>>>> about other languages which use Cyrillic alphabet but add
>>>>> their own diacritic characters? Think about Belarusian,
>>>>> Ukrainian, Serbian, Chechen, Chuvash, Mari, Ossetian, Yakut,
>>>>> Tatar, and more. What about languages which use Cyrillic
>>>>> alphabet but transliterate their respective letters in a
>>>>> different way than Russian? For example, Russian "Ъ" is (I
>>>>> think) usually skipped in transliteration, I think you
>>>>> propose "``", but when transliterating from Bulgarian they
>>>>> usually transliterate this as "ă".
>>>>>
>>>>> Few remarks:
>>>>>
>>>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be
>>>>> better? * You transliterate "ц" as "cz", wouldn't "ts" be
>>>>> better? By the way, in Polish language "cz" is a correct
>>>>> transliteration of "ч". * You transliterate "й" as "j", this
>>>>> is fine in many languages but wouldn't "y" be better in
>>>>> English? * In case of "е": how will you know if it is
>>>>> correct to transliterate it to "e" or "ie" or "je" or "ye"?
>>>>>
>>>>> These remarks are obviously incomplete, your patch deserves
>>>>> much more attention to review.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Rafal
>>>>>
>>>>
>>>
>>>
>>
>
>