This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2


Thank you, Egor.  I am looking at your patch and although I have
not yet finished, here are some remarks:

First of all, I think that such a large patch should also include
the tests.  Please see how automatic tests are performed in locale
data and write your own.

11.10.2018 00:29 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> From this patch I have excluded locales that already mention cyrillic or
> have a transliteration table for it:
> az_AZ
> iso14651_t1_common
> ky_KG
> mn_MN
> sr_RS
> tg_TJ
> tk_TM
> tt_RU
> uk_UA
> uz_UZ
> uz_UZ@cyrillic
> [...]

I think that eventually we would like to include your translit_cyrillic
also in these locales because I assume that your rules should work good
for them as well, also should include more characters than the individual
language contributors took into account.  Similarly to Mike's work on
collation: a common rules were created and all locales include them adding
their own language specific modifications.

> [...]
> COMMIT MESSAGE:
> [...]
> I am excluding these locales from this proposed patch. I have written
> directly to locale maintainer emails listed in the files. Volodymyr
> Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
> Данило Шеган <danilo@gnome.org> (sr_YU, sr_CS) have confirmed the

I am not sure if we want Cyrillic text in the commit message.  Shouldn't
it be, uhm, tranlisterated? :-)

"sr_CS" - I guess you meant "sr_RS".

"sr_YU" has been dropped, do we want to mention it?

> [...]
> [BZ #2872]
> * localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79

Please start "Add" with an uppercase.  BTW, shouldn't it be "New file"
instead?

> System A transliteration System B transcription table from Cyrillic to
> Latin/ASCII.
> * localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE
> translit section.

Same, "Add" here.

> * localedata/locales/aa_DJ: Likewise.

Good (here and everywhere below).

> [...]
> diff -uNr a/localedata/locales/translit_cyrillic
> b/localedata/locales/translit_cyrillic
> --- a/localedata/locales/translit_cyrillic 1970-01-01 00:00:00.000000000
> +0000
> +++ b/localedata/locales/translit_cyrillic 2018-10-09 19:02:54.000000000
> +0000
> @@ -0,0 +1,383 @@
> +escape_char /
> +comment_char %
> +
> +% This file is part of the GNU C Library and contains locale data.
> +% The Free Software Foundation does not claim any copyright interest
> +% in the locale data contained in this file. The foregoing does not
> +% affect the license of the GNU C Library as a whole. It does not
> +% exempt you from the conditions of the license if your use would
> +% otherwise be governed by that license.
> +
> +% Transliterations of cyrillic letters to latin and/or ascii symbols.

"cyrillic" -> "Cyrillic"; "latin" -> "Latin"; "ascii" -> "ASCII".

> +% Inspired by ISO 9.1995 / GOST 7.79-2000.
> +% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
> +% i.e [U4001-U4F9, U2019] but only the letters covered by ISO 9.1995

Typos:

"i.e" -> "i.e.," (somebody please fix me if I'm wrong here)
"U4001" - I guess you meant "U0401"
"U4F9" -> "U04F9".  I think that "U4F9" is not definitely bad but
let's be consistent.

Also I can see some gaps in the range.  Are you going to fill them
or maybe for now just mention that they exist?

> +% It implements the GOST_7.79 System A (Latin Script) as a first
> +% option and System B Cyrillic (ASCII) as a second option. Check
> +% https://en.wikipedia.org/wiki/ISO_9 for reference.
> +% The System B is extended from GOST_7.79-Russian using open sources
> +% of the transliteration mappings and the "h/`" diacritics logic.

What is "h/`" diacritics logic?

> +
> +% Usage examples:
> +% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
> +% | iconv -f ISO-8859-15 -t UTF-8 # System A
> +% iconv -f UTF-8 -t ASCII//TRANSLIT # System B.
> +
> +% Contributions welcome for the rest of Cyrillic script in Unicode

Sure, I'm not going to stop you from pushing these changes just because
there are missing characters.  I will consider adding them later.

> +% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.
> +% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872.
> +% Generated from UnicodeData.txt with
> +% https://sourceware.org/bugzilla/attachment.cgi?id=11301.

1. Is the file really generated with a script and not modified later?
If yes then maybe you should contribute the script instead?  In that case,
you should also not post this file to libc-locale, maintainers and
developers should be able to regenerate it.
2. The link leads to a LibreOffice spreadsheet.

> +LC_CTYPE
> +
> +translit_start
> +

<U0400> is missing here.  Are you going to leave it for now?

> +% CYRILLIC CAPITAL LETTER IO
> +<U0401> <U00CB>;"<U0059><U004F>"
> [...]
> +% CYRILLIC CAPITAL LETTER KJE
> +<U040C> <U1E30>;"<U004B><U0060>"

<U040D> is missing here.  Can we add it already?

> +% CYRILLIC CAPITAL LETTER SHORT U
> +<U040E> <U016C>;"<U0055><U0060>"
> [...]
> +% CYRILLIC CAPITAL LETTER U
> +<U0423> <U0055>
> +% CYRILLIC UNDEFINED
> +<U0423><U0301> <U00DA>;"<U0055><U0060>"

This still makes me wonder.

Does it work at all?
What if we remove this rule, won't it be transliterated as
<U0423> => "U", <U0301> - left unchanged, so "U" + <U0301>"
will eventually produce "Ú"?
Why is it called "UNDEFINED"?
Do we need similar rules for other characters?

> [...]
> +% CYRILLIC SMALL LETTER U
> +<U0443> <U0075>
> +% CYRILLIC UNDEFINED
> +<U0443><U0301> <U00FA>;"<U0075><U0060>"

Same here.

> [...]
> +% CYRILLIC SMALL LETTER YA
> +<U044F> <U00E2>;"<U0079><U0061>"

Again <U0450> missing (because it is lowercase variant of <U0400>).

> +% CYRILLIC SMALL LETTER IO
> +<U0451> <U00EB>;"<U0079><U006F>"
> [...]
> +% CYRILLIC SMALL LETTER KJE
> +<U045C> <U1E31>;"<U006B><U0060>"

<U045D> missing (same reason as <U040D>).

> +% CYRILLIC SMALL LETTER SHORT U
> +<U045E> <U016D>;"<U0075><U0060>"
> +% CYRILLIC SMALL LETTER DZHE
> +<U045F> "<U0064><U0302>";"<U0064><U0068>"

More letters missing here.  Is this because they are historic so we
don't want to include them now?  Well, but "YUS" is also historic.
(Please, do not remove YUS for consistency).

> +% CYRILLIC CAPITAL LETTER BIG YUS
> +<U046A> <U01CD>;"<U004F><U0060>"
> +% CYRILLIC SMALL LETTER BIG YUS
> +<U046B> <U01CE>;"<U006F><U0060>"
> [...]

I will continue but, again, I don't give any ETA so other reviewers
are welcome here.

Regards,

Rafal


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]