This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

From: Rafal Luzynski <digitalfreak at lingonborough dot com>
To: Egor Kobylkin <egor at kobylkin dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org, mfabian at redhat dot com, Marko Myllynen <myllynen at redhat dot com>, "Dmitry V. Levin" <ldv at altlinux dot org>
Cc: Volodymyr Lisivka <vlisivka at gmail dot com>, Max Kutny <mkutny at gmail dot com>, danilo at gnome dot org
Date: Sat, 13 Oct 2018 02:59:17 +0200 (CEST)
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <d5582688-819b-90c2-3f4a-0d19c932d487@kobylkin.com>
Reply-to: Rafal Luzynski <digitalfreak at lingonborough dot com>

Egor,

Thank you for the update.  I took a closer look at your patch so this
time my review is more complete than before although not yet fully complete.

As far as I understand, ISO-9 and its GOST variants are meant to be
universal rather than Russian-specific.  Therefore it is correct to place
them in the external file, like translit_cyrillic, and then include this
file in other locales adding locale specific modifications, if required.
For example, if there are any Russian-specific rules not included in this
file, they should go to ru_RU.

The text of the ISO-9 standard is not available in public, have we got
anything better than an article in Wikipedia?

Regarding the format of your commit message, I hesitate to say anything
more because there are more experienced maintainers around here.  Please
take a look at the Contribution Checklist. [1]

While at this, what is your legal relationship with GLIBC project?  Have
you signed the FSF Copyright Assignment?  It is not necessary for the locale
data but it might be necessary if you are going to contribute the testing code.

Regarding the tests, I think there is no complete transliteration test
suite at the moment.  Probably the only test is localedata/bug-iconv-trans.c.
You can also see the collation tests placed in the same directory, they
use those multiple *.UTF-8.in files.

You can skip the tests for now.

Technical issue:  Please either attach your patch to the email message or
paste it inline, not both.  The patch as it is now is not applicable.
I had to edit it manually to apply.

12.10.2018 16:05 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> From this patch I have excluded locales that already mention cyrillic or
> have a transliteration table for it:
> az_AZ
> iso14651_t1_common
> ky_KG
> mn_MN
> sr_RS
> tg_TJ
> tk_TM
> tt_RU
> uk_UA
> uz_UZ
> uz_UZ@cyrillic

I confirm that these locales are excluded and there are no other missing
locales.

> [...]
>
> diff -uNr a/localedata/locales/C b/localedata/locales/C
> --- a/localedata/locales/C 2018-10-11 15:10:12.000000000 +0000
> +++ b/localedata/locales/C 2018-10-11 15:10:43.000000000 +0000

There is no such file.  Where have you got the source code from?  Are you
sure this is glibc? :-)

> [...]
> diff -uNr a/localedata/locales/am_ET b/localedata/locales/am_ET
> --- a/localedata/locales/am_ET 2018-10-11 15:10:11.000000000 +0000
> +++ b/localedata/locales/am_ET 2018-10-11 15:10:43.000000000 +0000
> @@ -1394,6 +1394,7 @@
> <U137A> <U0060><U0039><U0030>
> <U137B> <U0060><U0031><U0030><U0030>
> <U137C> <U0060><U0031><U0030><U0030><U0030><U0030>
> +include "translit_cyrillic";""
> translit_end
> %
> END LC_CTYPE

Shouldn't “include "translit_cyrillic";""” be placed before the custom rules,
together with other includes?  The same in more files, I will not mention
them all.

> [...]
> diff -uNr a/localedata/locales/sd_IN@devanagari
> b/localedata/locales/sd_IN@devanagari
> --- a/localedata/locales/sd_IN@devanagari 2018-10-11 15:10:18.000000000
> +0000
> +++ b/localedata/locales/sd_IN@devanagari 2018-10-11 15:10:49.000000000
> +0000

Those 3 lines have been broken by the email agent, the patch is not applicable.

> [...]
> diff -uNr a/localedata/locales/sd_PK b/localedata/locales/sd_PK
> --- a/localedata/locales/sd_PK 2018-10-11 15:10:18.000000000 +0000
> +++ b/localedata/locales/sd_PK 2018-10-11 15:10:49.000000000 +0000

There is no such file in glibc.

> [...]
> diff -uNr a/localedata/locales/translit_cyrillic
> b/localedata/locales/translit_cyrillic
> --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000
> +0000
> +++ b/localedata/locales/translit_cyrillic 2018-10-11 15:10:52.000000000
> +0000

Again 3 lines broken, the patch is not applicable.

> [...]
> +% Contributions welcome for the rest of Cyrillic script in Unicode
> +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.

I am still tempted to add more Cyrillic characters but I understand
that it must be clearly separated which transliteration rules come from
ISO-9 and which are our own invention.  But that's not for now.

> [...]
> +translit_start
> +
> +% CYRILLIC CAPITAL LETTER IO
> +<U0401> <U00CB>;"<U0059><U004F>"

This says that for ASCII (GOST 7.79 System B) you would like to transliterate
"Ё" as "YO" but the table in Wikipedia says "Yo".  I understand that one or
another may be correct depending on the context but we should be consistent
and also better let's stick with the standard.

> +% CYRILLIC CAPITAL LETTER DJE
> +<U0402> <U0110>;"<U0044><U004A>"

This says "DJ" but System B does not mention it.  Where does it come from?
Also, I think it should be "Dj" rather than "DJ".

> +% CYRILLIC CAPITAL LETTER GJE
> +<U0403> <U01F4>;"<U0047><U0060>"

Correct, according to both systems.

> +% CYRILLIC CAPITAL LETTER UKRAINIAN IE
> +<U0404> <U00CA>;"<U0059><U0065>"

"Ye" - correct.

> +% CYRILLIC CAPITAL LETTER DZE
> +<U0405> <U1E90>;"<U005A><U0060>"

Correct.

> +% CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
> +<U0406> <U00CC>;<U0049>

Correct.  The table mentions an alternative transliteration "I`" but
says that it is "only before vowels for Old Russian and Old Bulgarian".
I think we can skip this other variant.

> +% CYRILLIC CAPITAL LETTER YI
> +<U0407> <U00CF>;"<U0059><U0069>"

"Yi" - correct.

> +% CYRILLIC CAPITAL LETTER JE
> +<U0408> "<U004A><U030C>";<U004A>

Correct.

> +% CYRILLIC CAPITAL LETTER LJE
> +<U0409> "<U004C><U0302>";"<U004C><U0060>"

Correct, according to the standard.  If Serbian language requires "Lj"
then overrides should go to sr_RS file.

> +% CYRILLIC CAPITAL LETTER NJE
> +<U040A> "<U004E><U0302>";"<U004E><U0060>"

Correct, the same comment.

> +% CYRILLIC CAPITAL LETTER TSHE
> +<U040B> <U0106>;"<U0054><U0053><U0048>"

Where does "TSH" come from?  It is not mentioned by the System B table.
Also I am afraid this is not correct.

> +% CYRILLIC CAPITAL LETTER KJE
> +<U040C> <U1E30>;"<U004B><U0060>"

Correct.

> +% CYRILLIC CAPITAL LETTER SHORT U
> +<U040E> <U016C>;"<U0055><U0060>"

"U`" - correct.

> +% CYRILLIC CAPITAL LETTER DZHE
> +<U040F> "<U0044><U0302>";"<U0044><U0068>"

"Dh" - correct.

> [...]
> +% CYRILLIC CAPITAL LETTER ZHE
> +<U0416> <U017D>;"<U005A><U0048>"

"ZH" - shouldn't be "Zh"?

> [...]
> +% CYRILLIC UNDEFINED
> +<U0423><U0301> <U00DA>;"<U0055><U0060>"

1. I think it should be named "CYRILLIC CAPITAL LETTER U WITH ACUTE".
2. OK, the System A table mentions this letter but System B does not.
   Somehow we should handle it.  I think that "U`" is the best we can
   do for now.
3. It must be tested whether this actually works.

> [...]
> +% CYRILLIC CAPITAL LETTER HA
> +<U0425> <U0048>;<U0058>

I don't think that "H" is unavailable in any encoding therefore it will
always be transliterated as "H" and never as "X".  We can't help it and
I don't think it is bad.

> +% CYRILLIC CAPITAL LETTER TSE
> +<U0426> <U0043>;"<U0043><U005A>"

1. "CZ" - maybe should be "Cz"?
2. Are we able to implement the rule: "c before i, e, y, j"?

> +% CYRILLIC CAPITAL LETTER CHE
> +<U0427> <U010C>;"<U0043><U0048>"

"CH" -> "Ch"?

> +% CYRILLIC CAPITAL LETTER SHA
> +<U0428> <U0160>;"<U0053><U0048>"

"SH" -> "Sh"?

> +% CYRILLIC CAPITAL LETTER SHCHA
> +<U0429> <U015C>;"<U0053><U0048><U0048>"

"SHH" -> "Shh"?

> +% CYRILLIC CAPITAL LETTER HARD SIGN
> +<U042A> <U02BA>;"<U0041><U0060>"

"A`" is only for Bulgarian and should go to bg_BG.  How should
we transliterate an upper case hard sign to plain ASCII?  I think
that just "``", same as lower case.

> +% CYRILLIC CAPITAL LETTER YERU
> +<U042B> <U0059>;"<U0059><U0060>"

Again, as "Y" is always available it will never be transliterated
as "Y`".

> +% CYRILLIC CAPITAL LETTER SOFT SIGN
> +<U042C> <U02B9>;<U0060>

OK, I like it to be transliterated to plain ASCII as "`".

> +% CYRILLIC CAPITAL LETTER E
> +<U042D> <U00C8>;"<U0045><U0060>"

OK

> +% CYRILLIC CAPITAL LETTER YU
> +<U042E> <U00DB>;"<U0059><U0055>"

"YU" -> "Yu"?

> +% CYRILLIC CAPITAL LETTER YA
> +<U042F> <U00C2>;"<U0059><U0041>"

"YA" -> "Ya"?

> [...]

I am sorry, this is of course incomplete but that's enough for tonight.

Regards,

Rafal

[1] https://sourceware.org/glibc/wiki/Contribution%20checklist

Follow-Ups:
- Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin

References:
- [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]