This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2

From: Egor Kobylkin <egor at kobylkin dot com>
To: Rafal Luzynski <digitalfreak at lingonborough dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org, mfabian at redhat dot com, Marko Myllynen <myllynen at redhat dot com>
Cc: "Dmitry V. Levin" <ldv at altlinux dot org>, Volodymyr Lisivka <vlisivka at gmail dot com>, Max Kutny <mkutny at gmail dot com>, danilo at gnome dot org
Date: Thu, 11 Oct 2018 16:59:11 +0200
Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <bcb7fcd8-7f71-4f7e-6804-7c3f07d6d3ee@kobylkin.com> <180516689.458569.1539255868196@poczta.nazwa.pl>

Hi Rafal
On 11.10.2018 13:04, Rafal Luzynski wrote:
> Thank you, Egor.  I am looking at your patch and although I have
> not yet finished, here are some remarks:
> 
> First of all, I think that such a large patch should also include
> the tests.  Please see how automatic tests are performed in locale
> data and write your own.
Could you please point me to the existing automatic tests?
Locally I am using the test suggested in glibc locales wiki.
>From my commit message:
"The glibc wiki explicitly lists this use case as the test example
https://sourceware.org/glibc/wiki/Locales#Testing_Locales :
LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt
"
I am visually checking whether any iconv run fails for all those locales
but you must refer to some automated unit test with a boolean outcome,
right?

> 
> 11.10.2018 00:29 Egor Kobylkin <egor@kobylkin.com> wrote:
>> [...]
>> From this patch I have excluded locales that already mention cyrillic or
>> have a transliteration table for it:
>> az_AZ
>> iso14651_t1_common
>> ky_KG
>> mn_MN
>> sr_RS
>> tg_TJ
>> tk_TM
>> tt_RU
>> uk_UA
>> uz_UZ
>> uz_UZ@cyrillic
>> [...]
> 
> I think that eventually we would like to include your translit_cyrillic
> also in these locales because I assume that your rules should work good
> for them as well, also should include more characters than the individual
> language contributors took into account.  Similarly to Mike's work on
> collation: a common rules were created and all locales include them adding
> their own language specific modifications.

This is fine with me. Should anybody supply translit_xxxxxxxxxxxx for
any of the mentioned locales we can include them as well. Wouldn't it be
easier to coordinate those as separate patches though?

> 
>> [...]
>> COMMIT MESSAGE:
>> [...]
>> I am excluding these locales from this proposed patch. I have written
>> directly to locale maintainer emails listed in the files. Volodymyr
>> Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
>> Данило Шеган <danilo@gnome.org> (sr_YU, sr_CS) have confirmed the
> 
> I am not sure if we want Cyrillic text in the commit message.  Shouldn't
> it be, uhm, tranlisterated? :-)

I do not see any Cyrillic text in the commit message.
the ?????? you see are the actual "?" symbols coming out of iconv now.

> 
> "sr_CS" - I guess you meant "sr_RS".
> 
> "sr_YU" has been dropped, do we want to mention it?

The list of locales and the patch itself is generated from the actual
locales - I do not hand pick them, only exclude the ones in the
exclusion list above.

> 
>> [...]
>> [BZ #2872]
>> * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79
> 
> Please start "Add" with an uppercase.  BTW, shouldn't it be "New file"
> instead?
> 
>> System A transliteration System B transcription table from Cyrillic to
>> Latin/ASCII.
>> * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE
>> translit section.
> 
> Same, "Add" here.
> 
>> * localedata/locales/aa_DJ: Likewise.
> 
> Good (here and everywhere below).
> 
>> [...]
>> diff -uNr a/localedata/locales/translit_cyrillic
>> b/localedata/locales/translit_cyrillic
>> --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000
>> +0000
>> +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.000000000
>> +0000
>> @@ -0,0 +1,383 @@
>> +escape_char /
>> +comment_char %
>> +
>> +% This file is part of the GNU C Library and contains locale data.
>> +% The Free Software Foundation does not claim any copyright interest
>> +% in the locale data contained in this file. The foregoing does not
>> +% affect the license of the GNU C Library as a whole. It does not
>> +% exempt you from the conditions of the license if your use would
>> +% otherwise be governed by that license.
>> +
>> +% Transliterations of cyrillic letters to latin and/or ascii symbols.
> 
> "cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII".
> 
>> +% Inspired by ISO 9.1995 / GOST 7.79-2000.
>> +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
>> +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995
> 
> Typos:
> 
> "i.e" -> "i.e.," (somebody please fix me if I'm wrong here)
> "U4001" - I guess you meant "U0401"
> "U4F9" -> "U04F9".  I think that "U4F9" is not definitely bad but
> let's be consistent.

These are all good catches. I will fix them and resubmit.

> 
> Also I can see some gaps in the range.  Are you going to fill them
> or maybe for now just mention that they exist?
> 
No, were not going to fill them please see this:
On 10.10.2018 14:34, Marko Myllynen wrote:
> On 2018-10-10 15:19, Egor Kobylkin wrote:
>> On 10.10.2018 13:22, Marko Myllynen wrote:
>>>> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303
>>> Although I haven't checked every rule this in general looks very good
>>> (but see below).
>>> Not sure do we want to add the few missing characters
>>> mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode,
>>> e.g., one instantly notices that U+0400 is missing. (I wouldn't add at
>>> least initially the more exotic characters, like the historic ones,
>>> though.) Perhaps filing a bug or two for these cases for separate
>>> consideration would be ok.
>> The question here is what should serve as their transliteration and
>> transcription?
> Not sure, so filing a separate bug about this once your patch is merged
> might be the most suitable action for now, I don't think we want to
> postpone merging your work further due to these non-ISO 9 cases.
>


>> +% It implements the GOST_7.79 System A (Latin Script) as a first
>> +% option and System B Cyrillic (ASCII) as a second option. Check
>> +% https://en.wikipedia.org/wiki/ISO_9 for reference.
>> +% The System B is extended from GOST_7.79-Russian using open sources
>> +% of the transliteration mappings and the "h/`" diacritics logic.
> 
> What is "h/`" diacritics logic?
Basically some Linguist mentioned that they have chosen "h" and '`" to
represent the diacritics for the transcription (i.e. GOST 7.79 System
B). This way there is some resemblance to the watertight transliteration
as per ISO 9 (Sysetem A) but it is still all in ASCII. We have decided
to extend GOST 7.79 to the all ISO 9 characters and so I have extended
it following that Linguist logic.


>> +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.
>> +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872.
>> +% Generated from UnicodeData.txt with
>> +% https://sourceware.org/bugzilla/attachment.cgi?id=11301.
> 
> 1. Is the file really generated with a script and not modified later?
> If yes then maybe you should contribute the script instead?  In that case,
> you should also not post this file to libc-locale, maintainers and
> developers should be able to regenerate it.
> 2. The link leads to a LibreOffice spreadsheet.
No, I do not have a script. The "generated" means it is a result of
formulas in that spreadsheet. People are welcome to write a script that
should be straightforward implementation of those rules in formulas.

> 
>> +LC_CTYPE
>> +
>> +translit_start
>> +
> 
> <U0400> is missing here.  Are you going to leave it for now?
Yes, it is to be left out, not in ISO 9. See the exchange with Marko above.

> 
>> +% CYRILLIC CAPITAL LETTER IO
>> +<U0401> <U00CB>;"<U0059><U004F>"
>> [...]
>> +% CYRILLIC CAPITAL LETTER KJE
>> +<U040C> <U1E30>;"<U004B><U0060>"
> 
> <U040D> is missing here.  Can we add it already?
Yes, it is to be left out, not in ISO 9. See the exchange with Marko above.

> 
>> +% CYRILLIC CAPITAL LETTER SHORT U
>> +<U040E> <U016C>;"<U0055><U0060>"
>> [...]
>> +% CYRILLIC CAPITAL LETTER U
>> +<U0423> <U0055>
>> +% CYRILLIC UNDEFINED
>> +<U0423><U0301> <U00DA>;"<U0055><U0060>"
> 
> This still makes me wonder.
> 
> Does it work at all?
> What if we remove this rule, won't it be transliterated as
> <U0423> => "U", <U0301> - left unchanged, so "U" + <U0301>"
> will eventually produce "Ú"?
> Why is it called "UNDEFINED"?
On 10.10.2018 14:34, Marko Myllynen wrote:
> On 2018-10-10 15:19, Egor Kobylkin wrote:
>> On 10.10.2018 13:22, Marko Myllynen wrote:
...
>>> I'm not sure this will work, no existing rule in translit_* files
>>> contain two characters, I'd assume that the rule for U+0423 is applied
>>> first and then the below rule is never used.
>>>
>>> % CYRILLIC UNDEFINED
>>> <U0423><U0301> <U00DA>;"<U0055><U0060>"
>>>
>>> Perhaps this should be commented out or removed altogether if it's not
>>> working as intended.
>>
>> So yes, they are not processed. I would drop them to not to have special
>> cases. But I am also fine with keeping them because all work is done
>> already.
> I'd probably drop them but I don't feel strongly about this either way.
>
> Thanks for your efforts, I don't have any further comments, I'll leave
> this now for Rafal and Mike to provide additional feedback and hopefully
> merge soon.

Could you also please check the discussion with Marko on UNDEFINED and
other related topics? You were on To: or CC: for those emails.
The same for the other characters below.

> Do we need similar rules for other characters?
> 
>> [...]
>> +% CYRILLIC SMALL LETTER U
>> +<U0443> <U0075>
>> +% CYRILLIC UNDEFINED
>> +<U0443><U0301> <U00FA>;"<U0075><U0060>"
> 
> Same here.
> 
>> [...]
>> +% CYRILLIC SMALL LETTER YA
>> +<U044F> <U00E2>;"<U0079><U0061>"
> 
> Again <U0450> missing (because it is lowercase variant of <U0400>).
> 
>> +% CYRILLIC SMALL LETTER IO
>> +<U0451> <U00EB>;"<U0079><U006F>"
>> [...]
>> +% CYRILLIC SMALL LETTER KJE
>> +<U045C> <U1E31>;"<U006B><U0060>"
> 
> <U045D> missing (same reason as <U040D>).
> 
>> +% CYRILLIC SMALL LETTER SHORT U
>> +<U045E> <U016D>;"<U0075><U0060>"
>> +% CYRILLIC SMALL LETTER DZHE
>> +<U045F> "<U0064><U0302>";"<U0064><U0068>"
> 
> More letters missing here.  Is this because they are historic so we
> don't want to include them now?  Well, but "YUS" is also historic.
> (Please, do not remove YUS for consistency).
> 
>> +% CYRILLIC CAPITAL LETTER BIG YUS
>> +<U046A> <U01CD>;"<U004F><U0060>"
>> +% CYRILLIC SMALL LETTER BIG YUS
>> +<U046B> <U01CE>;"<U006F><U0060>"
>> [...]
> 
> I will continue but, again, I don't give any ETA so other reviewers
> are welcome here.
> 
> Regards,
> 
> Rafal
> 

Bests,
Egor

--- Begin Message ---

From: Marko Myllynen <myllynen at redhat dot com>

To: Egor Kobylkin <egor at kobylkin dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>

Cc: Keld Simonsen <keld at keldix dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org, "Dmitry V. Levin" <ldv at altlinux dot org>, Volodymyr Lisivka <vlisivka at gmail dot com>, Carlos O'Donell <carlos at redhat dot com>, Max Kutny <mkutny at gmail dot com>, danilo at gnome dot org

Date: Wed, 10 Oct 2018 15:34:26 +0300

Subject: Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] re-submission for 2.29

Envelope-to: <egor at kobylkin dot com>

Organization: Red Hat

References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20181003091949.GA21486@rap.rap.dk> <21d872b2-613e-d1f5-26c0-baa4b5721df9@kobylkin.com> <1485772360.805333.1538731225156@poczta.nazwa.pl> <deacdf31-d0bb-a92d-1de3-934d6b4cb158@kobylkin.com> <bda2ca60-18f1-3b19-91e5-c9ad144bc834@redhat.com> <bb4e1ba5-5fa5-2986-2573-7d27be226124@kobylkin.com> <69e26cab-810e-824b-3b16-b75ac44d8b0c@redhat.com> <b8f02fe9-f911-487f-b50b-9b0c43191cb6@kobylkin.com> <f51992ad-008b-03a4-8880-4c12edced53b@redhat.com> <246390048.827062.1539037422672@poczta.nazwa.pl> <4db1ce91-3184-cf45-01c5-80667fc4cf65@kobylkin.com> <f6b530b0-53b7-bd90-9bb9-864d0a477f50@kobylkin.com> <a9af47d8-bf3d-e607-38e1-a6e765a604d3@kobylkin.com> <1198370378.413479.1539123456488@poczta.nazwa.pl> <70c29e42-0fd3-4f10-fafb-44d67190d870@kobylkin.com> <c89f0ac3-6ccb-3e41-dc26-75ef03d9afa1@kobylkin.com> <9edcf6f2-607c-91ac-8eaf-ffbc973fe597@redhat.com> <3f50cc1f-9493-0611-3478-0394ecb6b37e@kobylkin.com>

Reply-to: Marko Myllynen <myllynen at redhat dot com>
Hi,

On 2018-10-10 15:19, Egor Kobylkin wrote:
> On 10.10.2018 13:22, Marko Myllynen wrote:
>>> correct link https://sourceware.org/bugzilla/attachment.cgi?id=11303
>>
>> Although I haven't checked every rule this in general looks very good
>> (but see below). 
> 
>> Not sure do we want to add the few missing characters
>> mentioned at https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode,
>> e.g., one instantly notices that U+0400 is missing. (I wouldn't add at
>> least initially the more exotic characters, like the historic ones,
>> though.) Perhaps filing a bug or two for these cases for separate
>> consideration would be ok.
> 
> The question here is what should serve as their transliteration and
> transcription?

Not sure, so filing a separate bug about this once your patch is merged
might be the most suitable action for now, I don't think we want to
postpone merging your work further due to these non-ISO 9 cases.

>> I'm not sure this will work, no existing rule in translit_* files
>> contain two characters, I'd assume that the rule for U+0423 is applied
>> first and then the below rule is never used.
>>
>> % CYRILLIC UNDEFINED
>> <U0423><U0301> <U00DA>;"<U0055><U0060>"
>>
>> Perhaps this should be commented out or removed altogether if it's not
>> working as intended.
> 
> So yes, they are not processed. I would drop them to not to have special
> cases. But I am also fine with keeping them because all work is done
> already.
I'd probably drop them but I don't feel strongly about this either way.

Thanks for your efforts, I don't have any further comments, I'll leave
this now for Rafal and Mike to provide additional feedback and hopefully
merge soon.

Thanks,

-- 
Marko Myllynen
--- End Message ---

Follow-Ups:
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2
  - From: Egor Kobylkin

References:
- [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2
  - From: Egor Kobylkin
- Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2
  - From: Rafal Luzynski

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]