This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29


Hi,

Thanks for the update. I have few mostly cosmetic comments below,
hopefully we'll hear from others whether they agree with this direction.

- Please add the standard glibc locale header (see the existing
translit_* files for reference)
- Consider wrapping the header lines at or around column 70-72
- Consider describing which characters, character ranges, or blocks are
supported (perhaps also describe why some of those are not included, see
e.g. https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode)
- Please remove trailing whitespaces and spaces after ;
- No duplicates:

% CYRILLIC SMALL LETTER IE
<U0435> <U0065>; <U0065>

should become:

% CYRILLIC SMALL LETTER IE
<U0435> <U0065>

- There are few issues with the definitions:

% CYRILLIC CAPITAL LETTER U
<U0423> <U0055>; <U0055>
% CYRILLIC UNDEFINED
<U0423><U0423> <U00DA>; "<U0055><U0060>"

% CYRILLIC SMALL LETTER U
<U0443> <U0075>; <U0075>
% CYRILLIC UNDEFINED
<U0443><U0443> <U00FA>; "<U0075><U0060>"

I wonder would it be possible to automate generation of this file so
that issues like the above could avoided? But perhaps that could be the
next step once this initial patch lands.

Thanks,

On 2018-10-05 23:47, Egor Kobylkin wrote:
> After some kind help from Marko in the offline discussion
> I realized the multi/single character approach I originally took was
> against the  of the iconv(1) logic anyway. So there is no harm in
> dropping it and adopting Marko's suggestion instead. I will do so and
> will resubmit the patch with ISO 9:1995/GOST 7.79 System A + fallback to
> GOST 7.79 System B (for ASCII).
> 
> However this doesn't resolve the issue for ASCII part being different
> for various locales. Again, I am offering the locale maintainers to let
> me know if they want to 1) adopt the one I am supplying, 2) write their
> own or 3) ignore the patch altogether. Your feedback is appreciated!
> 
> This is the relevant part that helped:
>> The first part (ISO-8859-15 or ASCII) defines the target encoding for
>> iconv(1). //TRANSLIT is described in the iconv(1) man page as:
>>
>> If the string //TRANSLIT is appended to to-encoding,  characters 
>> being  converted  are  transliterated  when needed and possible. This
>> means that when a character cannot be  represented  in  the target
>> character set, it can be approximated through one or sev‐ eral
>> similar looking characters.  Characters that are outside of the
>> target  character  set  and  cannot  be  transliterated are replaced
>> with a question mark (?) in the output.
>>
>> So in the above examples, iconv(1) encounters the character U+0428
>> which is not part of either of the target encoding and since
>> //TRANSLIT is specified, iconv(1) tries transliteration according to
>> the rules defined above, in case of ASCII U+0160 is not part of the
>> target encoding so the next alternative is used.
> 
> Bests,
> Egor Kobylkin
> 
> On 05.10.2018 14:21, Marko Myllynen wrote:
>> Hi,
>>
>> The scheme I proposed would also be ASCII compatible; consider this 
>> example:
>>
>> % CYRILLIC CAPITAL LETTER SHA <U0428> "<U0160>";"<U0053><U0068>"
>>
>> "printf \\u0428\\n | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT | iconv 
>> -f ISO-8859-15 -t UTF-8" would produce Š as per System A and "printf
>>  \\u0428\\n | iconv -f UTF-8 -t ASCII//TRANSLIT" would produce Sh as 
>> per System B.
>>
>> Thanks,
>>
>> On 2018-10-05 15:00, Egor Kobylkin wrote:
>>> Hi Marko,
>>>
>>> I have chosen the System B because it is ASCII compartible. System 
>>> A is not ASCII compartible (diacritics in target).
>>>
>>> https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A
>>>
>>>
>>>
> "GOST 7.79 contains two transliteration tables.
>>>
>>> System A one Cyrillic character to one Latin character, some with 
>>> diacritics – identical to ISO 9:1995
>>>
>>> System B one Cyrillic character to one or many Latin characters 
>>> without diacritics " Hope this helps, Egor
>>>
>>> On 05.10.2018 13:54, Marko Myllynen wrote:
>>>> Hi,
>>>>
>>>> Would it make sense to first use ISO 9:1995/GOST 7.79 System A if
>>>> possible and if not, then fall back to GOST 7.79 System B?
>>>>
>>>> Implementation-wise current translit_* files have few examples 
>>>> where a non-ASCII transliteration is tried first before an ASCII 
>>>> fallback. These examples are from translit_neutral:
>>>>
>>>> % NARROW NO-BREAK SPACE <U202F> <U00A0>;<U0020> % REVERSED
>>>> TRIPLE PRIME <U2037>
>>>> "<U2035><U2035><U2035>";"<U0060><U0060><U0060>"
>>>>
>>>> Thanks,
>>>>
>>>> On 2018-10-05 13:29, Egor Kobylkin wrote:
>>>>> Keld,Marko,Rafal, other locale maintainers,
>>>>>
>>>>> this all is written with having in mind a minimal viable fix 
>>>>> for this bug asap. I want to avoid wasting maintainers time 
>>>>> getting into fundamental discussions here (although for 
>>>>> perfectly good reasons).
>>>>>
>>>>> I see three options: 1. those locale maintainers that are fine 
>>>>> with using ISO 9:1995/GOST_7.79_System_B cyrillic 
>>>>> transliteration table (Ru) include it in their locales (see 
>>>>> attached screenshot of the table). 2. those that that want to 
>>>>> have a differing table can create their own variety based on 
>>>>> the spreadsheet I have prepared 
>>>>> https://sourceware.org/bugzilla/attachment.cgi?id=8590 and 
>>>>> include it in this patch. 3. those that want to omit a
>>>>> cyrillic transliteration altogether for now state so and just
>>>>> carry over the bug #2872 from the year 2006.
>>>>>
>>>>> Does this make sense to you?
>>>>>
>>>>> Just to be super clear on this: the patch is a stopgap _ASCII_
>>>>>  transliteration table. ASCII being AMERICAN Standard Code for
>>>>>  Information Interchange, that is obviously orthogonal to any 
>>>>> transliteration rule of other countries. As such it is not 
>>>>> explicitly targeting transliteration standards of any country.
>>>>>
>>>>> The fact that the patch is reflecting Russian variety of ISO 
>>>>> 9:1995/GOST_7.79_System_B is because a) ISO 
>>>>> 9:1995/GOST_7.79_System_B is available and can be helpful to a 
>>>>> majority of cyrillic users b) I have access to it including
>>>>> via being proficient in Russian.
>>>>>
>>>>> It is offered to all the respective locale maintainers as a 
>>>>> stopgap solution. Stopgap in the sense that it is better to 
>>>>> have some transliteration than not to have any at all and
>>>>> carry over the bug from 2006. That it may be a somewhat
>>>>> officially correct transliteration for ru_RU is a bonus. In
>>>>> that sense I would dub the discussion on the correctness for
>>>>> other languages "offtopic". Let me know if this is not OK.
>>>>>
>>>>> You are all are correctly mentioning the deficiencies of this 
>>>>> approach. However, I couldn't find a better straightforward 
>>>>> approach as of yet. Happy to hear from you as on how this
>>>>> could be handled.
>>>>>
>>>>> There is a danger of being caught in the web of 
>>>>> language/country differences. I propose just pruning the 
>>>>> locales that are not comfortable including this current table. 
>>>>> We can address possible solutions in the second wave of 
>>>>> patching.
>>>>>
>>>>> I am vary of getting into discussions on specific country 
>>>>> variants just because of the sheer complexity of this topic.
>>>>> It is probably better addressed by respective maintainers of
>>>>> their locales. I do not see a "one fits all" solution in this
>>>>> first wave possible.
>>>>>
>>>>> I would like to have this "three options plan of action"
>>>>> vetted first and then we could go to the specific detail.
>>>>> (Like, for instance, what characters should be included in to
>>>>> the table, and in which transliteration form.)
>>>>>
>>>>> I am looking forward to your reply, Egor Kobylkin
>>>>>
>>>>> P.S. specifically as to how address languages other than Ru 
>>>>> included in GOST_7.79_System_B: we can take the first option 
>>>>> left to right from that table (Ru,By,Uk,Bg,Mk). Then it will 
>>>>> technically work for all those locales/languages but with 
>>>>> errors where Ru supersedes their own variants.
>>>>>
>>>>>
>>>>> On 05.10.2018 11:20, Rafal Luzynski wrote:
>>>>>> 3.10.2018 11:32 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>>>>
>>>>>>> On 03.10.2018 11:19, Keld Simonsen wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Please note that translitteration of Cyrillic to latin
>>>>>>>> is not universal. There are different schemes for for 
>>>>>>>> example German, English and Danish, and there is also an 
>>>>>>>> ISO standard for it.
>>>>>>>
>>>>>>> Thanks for your feedback, Keld!
>>>>>>>
>>>>>>> Could the locale maintainers that wouldn't like to include 
>>>>>>> this patch explicitly state so here?
>>>>>>
>>>>>> I think it is about me so I must reply.  I am sorry about 
>>>>>> that and the sole reason is my lack of time.  I'm just a 
>>>>>> volunteer here, that means it's not my regular job to work
>>>>>> on locale data nor anything in glibc nor in any other open 
>>>>>> source project.  I do these things only in my free time
>>>>>> which I don't have much.  Of course you will see my
>>>>>> contributions here and there but they are either trivial or
>>>>>> take me months to complete.  Your patches are on my radar but
>>>>>> I can't tell any ETA for them.  Of course, there are other
>>>>>> people around here and they are all welcome to come and
>>>>>> join.
>>>>>>
>>>>>>> That is: - In the case that there is a different preferred 
>>>>>>> cyrillic transliteration table for any specific locale 
>>>>>>> their maintainers may want to point me to it so I can 
>>>>>>> supply a separate table/patch. - Or they could state 
>>>>>>> explicitly that for some reason they would like to exclude 
>>>>>>> their locale from the patch for a default cyrillic 
>>>>>>> transliteration altogether.
>>>>>>
>>>>>> As Keld wrote, there are probably separate rules for every 
>>>>>> language so I don't think you should treat your rules as 
>>>>>> universal and include them in every locale.  At first sight, 
>>>>>> it seems to me they work only for English (as a destination 
>>>>>> locale).  Also, although it is called "transliteration from 
>>>>>> Cyrillic" it seems that it covers only Russian alphabet. What
>>>>>> about other languages which use Cyrillic alphabet but add
>>>>>> their own diacritic characters?  Think about Belarusian, 
>>>>>> Ukrainian, Serbian, Chechen, Chuvash, Mari, Ossetian, Yakut, 
>>>>>> Tatar, and more.  What about languages which use Cyrillic 
>>>>>> alphabet but transliterate their respective letters in a 
>>>>>> different way than Russian?  For example, Russian "Ъ" is (I 
>>>>>> think) usually skipped in transliteration, I think you 
>>>>>> propose "``", but when transliterating from Bulgarian they 
>>>>>> usually transliterate this as "ă".
>>>>>>
>>>>>> Few remarks:
>>>>>>
>>>>>> * I think you transliterate "щ" as "shh", wouldn't "shch" be 
>>>>>> better? * You transliterate "ц" as "cz", wouldn't "ts" be 
>>>>>> better?  By the way, in Polish language "cz" is a correct 
>>>>>> transliteration of "ч". * You transliterate "й" as "j", this 
>>>>>> is fine in many languages but wouldn't "y" be better in 
>>>>>> English? * In case of "е": how will you know if it is
>>>>>> correct to transliterate it to "e" or "ie" or "je" or "ye"?
>>>>>>
>>>>>> These remarks are obviously incomplete, your patch deserves 
>>>>>> much more attention to review.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Rafal
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
> 


-- 
Marko Myllynen


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]