This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]


Hi Egor,

Thank you for your updates and again I'm sorry for my delayed response.
A general remark about this: if you are in a hurry and you need the
corrected transliteration rules for yourself or for your users then
you don't have to wait for the patch to be reviewed and accepted here.
You can make your own locale and use it, you don't need to rebuild glibc,
you don't even need root privileges to do it.  The locale data subsystem
is designed to allow users create and use their own locales.

I have seen and tested locally your newer patch [1] but I will reply
in this thread because I think it is easier to reply in context.

I would like to summarize the differences between v5 [2] and v6 to make
sure that I noticed them all and that you have not introduced any changes
inadvertently.  (Yes, that means I have skipped another patch which you
sent between those two.)

* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* You consequently transliterate single uppercase Cyrillic letters
  to sequences of all uppercase Latin letters in all languages (whenever
  a Cyrillic letter is transliterated to more than one Latin letter),
  for example "Ї" is now transliterated as "YI" rather than "Yi".

Again I must say that I experienced lots of technical difficulties to apply
the patch and I had to rework it manually because it is not applicable as
it is now.  Here I explain below how to make a technically correct patch:

13.10.2018 18:58 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> 
> Hi Rafal,
> 
> Thanks for the thorough checking, it really helps.
> 
> On 13.10.2018 02:59, Rafal Luzynski wrote:
> > Technical issue:  Please either attach your patch to the email 
> > message or paste it inline, not both.  The patch as it is now is not 
> > applicable. I had to edit it manually to apply.
> >> diff -uNr a/localedata/locales/C b/localedata/locales/C --- 
> >> a/localedata/locales/C 2018-10-11 15:10:12.000000000 +0000 +++ 
> >> b/localedata/locales/C 2018-10-11 15:10:43.000000000 +0000
> > 
> > There is no such file.  Where have you got the source code from?
> > Are you sure this is glibc? :-)
> 
> I was running my patch process against the Ubuntu 18.04 version of
> localedata/locales. Now I have checked out the GitHub glibc source v2.28
> and done the same. [...]

Remarks:

* Please use the repository at https://sourceware.org/git/?p=glibc.git
  rather than a copy at GitHub.
* Please use the master branch rather than 2.28.
* Commit your work locally.
* Use "git format-patch" (e.g., "git format-patch HEAD^..HEAD") to generate
  the patch, then you can email it to this list.
* You can email it inline or, if your email client breaks the lines and
inserts
  other unnecessary characters, send as an attachment.
* Use "git pull --rebase" to keep your work up to date.
* Read the Contribution Checklist [3] for more details.

> 
> >> [...] From this patch I have excluded locales that already mention 
> >> cyrillic or have a transliteration table for it: az_AZ 
> >> iso14651_t1_common ky_KG mn_MN sr_RS tg_TJ tk_TM tt_RU uk_UA uz_UZ 
> >> uz_UZ@cyrillic
> > 
> > I confirm that these locales are excluded and there are no other 
> > missing locales.
> 
> Because of the surprisingly different list of locales between Ubuntu and
> glibc there is now a different list of excluded ones as well.
> 
> mn_MN
> sr_RS
> tg_TJ
> tk_TM
> tt_RU
> uk_UA
> uz_UZ
> uz_UZ@cyrillic
> uk_UA
> 
> az_AZ, ky_KG are now included

As far as I can see, there are no other differences between those two
patches.

> because they don't have cyrillic translit
> in glibc. iso14651_t1_common is still implicitly excluded, because it
> doesn't have 'translit_end' string.
> 
> Somehow az_AZ and tr_TR from glibc fail to transliterate Cyrillic even
> after the patch applied (az_AZ is explicitly including tr_TR). I do not
> see a reason, maybe you could check?

I noticed that az_AZ does not build at all, localedef program reports
a "circular dependency" (if I recall correctly).  I think that since az_AZ
contains “copy "tr_TR"” and tr_TR already contains (in your patch)
“include "translit_cyrillic";""” you should just remove
“include "translit_cyrillic";""” from az_AZ which effectively means that
there are no changes in az_AZ.  Optionally, you can add a comment to az_AZ
to explain why it does not contain “include "translit_cyrillic";""” and to
make sure that if anyone removes “copy "tr_TR"” ever in the future, the
“include "translit_cyrillic";""” will be added at the same time.  I have
verified that removing that line makes the locale data build without an
error but I have not yet verified that they work as expected.

> > Regarding the tests, I think there is no complete transliteration 
> > test suite at the moment.  Probably the only test is 
> > localedata/bug-iconv-trans.c. You can also see the collation tests 
> > placed in the same directory, they use those multiple *.UTF-8.in 
> > files.
> > 
> > You can skip the tests for now.
> 
> In the copy of localedata/bug-iconv-trans.c lines 10-11 we could just
> change the list of the symbols we are now transliterating
> 
>   const char str[] = "ÄäÖöÜüß";
>   const char expected[] = "AEaeOEoeUEuess";
> 
> like this
> 
>   const char str[] =
> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩъЫьЭЮЯабвгдежзийклмнопрстуу́фхцчшщЪыЬэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ
> ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
>   const char expected[] =
> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU`FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu`fxczchsh
> shh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`
> T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`
> Y`y`'";
> 
> First I though they could just be added but not all locales
> transliterate Umlauts so just extending the current test won't do as it
> will fail for those locales.

I noticed that you pasted a patch in a Bugzilla comment. [4] If I understand
correctly you suggest to rework the existing test case to test Cyrillic
transliteration instead of German.  Please don't do it: the existing test
cases may be extended but must not be removed.  I think we should rework
this
test case to handle multiple locales and multiple transliteration pairs;
optionally we can add a new case instead.  Currently I lean into reworking
the existing test case.

> >> [...] diff -uNr a/localedata/locales/am_ET 
> >> b/localedata/locales/am_ET --- a/localedata/locales/am_ET 
> >> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET 
> >> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A> 
> >> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C> 
> >> <U0060><U0031><U0030><U0030><U0030><U0030> +include 
> >> "translit_cyrillic";"" translit_end % END LC_CTYPE
> > 
> > Shouldn't “include "translit_cyrillic";""” be placed before the 
> > custom rules, together with other includes?  The same in more files, 
> > I will not mention them all.
> 
> If I recall correctly it is because of the
> "translit_end
> END LC_CTYPE"
> part at the end of the translit_cyrillic. This way it works for any
> locale, regardless whether it has translit itself or not. And being at
> the end it does not supersede any previous transliteration that may be
> there for a reason.
> 
> As with some other comments, I am not super familiar with the formats of
> glibc files. So if you have a definitive suggestion - pls. formulate it
> as an imperative, not a question.

I feel like a newcomer here so it was meant to be a question to other
more experienced maintainers but probably it's time to change this attitude.
So, also taking into account what Marko wrote, [5] please put the include
directive after all other include directives, or after the "translit_start"
directive if there are no other includes, rather than putting it just before
"translit_end".  Even if putting it at the dnd works sometimes or even
always.
Same as you put #include's near top of the file when writing a C program
even
if sometimes you may put it anywhere and it will work.  If you use a script
to insert your include directives then please rework it, if you insert them
manually then just move them manually.

> >> [...] +translit_start + +% CYRILLIC CAPITAL LETTER IO +<U0401> 
> >> <U00CB>;"<U0059><U004F>"
> > 
> > This says that for ASCII (GOST 7.79 System B) you would like to 
> > transliterate "Ё" as "YO" but the table in Wikipedia says "Yo".  I 
> > understand that one or another may be correct depending on the 
> > context but we should be consistent and also better let's stick with 
> > the standard.
> 
> The choice for YO, SH, YA, ZH etc. is to avoid naming collisions for
> example for "Сх" and "Ш" that would both transliterate to Sh:
> With SH:"Схема"->"Shema" but "Шема"->"SHema"
> With Sh:"Схема"->"Shema" and "Шема"->"Shema". Collision!
> This is important e.g. for renaming files, grouping as in using uniq etc.

I understand this idea.  Is this part of any existing standard?  I can't
see it regulated by GOST 7.79.

I'd rather not include the transliteration rules which seems reasonable to
us (the developers) but are not known and therefore not acceptable by the
outer world.

> 
> > 
> >> +% CYRILLIC CAPITAL LETTER DJE +<U0402> <U0110>;"<U0044><U004A>"
> > 
> > This says "DJ" but System B does not mention it.  Where does it come 
> > from? Also, I think it should be "Dj" rather than "DJ".
> I took the first two letters from its name.

As I said previously, I would like to add more Cyrillic letters even if
they are not regulated by any standard.  But let's separate them and make
it clear that these rules are based on GOST 7.79 and those are our own
invention (or come from other standard etc.)  I think that all these
rules may even be in the same file but in different parts of it.

> >> [...] +% CYRILLIC UNDEFINED +<U0423><U0301> 
> >> <U00DA>;"<U0055><U0060>"
> > 
> > 1. I think it should be named "CYRILLIC CAPITAL LETTER U WITH ACUTE".
> > 2. OK, the System A table mentions this letter but System B does not.
> > Somehow we should handle it.  I think that "U`" is the best we can do
> > for now. 3. It must be tested whether this actually works.
> 1. Let's do it just before you are ready to commit the patch, because it
> breaks formulas in my worksheet and I will have to do it manually?
> 3. I have tested and it doesn't work/gets ignored. But if you were to
> handle COMBINING it would work, wouldn't it?

My guess is that since translit_combining just removes all those combining
diacritic characters and translit_combining is usually included before
translit_cyrillic then <U0301> is removed even before <U0423> is taken
into account.  Also my another guess is that it might work good if you
just removed this rule: <U0423> would be translated to "U" and <U0301>
would remain unchanged and eventually those two characters would produce
"Ú".  But, again, that's just a guess, I have not tested.

> >> [...] +% CYRILLIC CAPITAL LETTER HA +<U0425> <U0048>;<U0058>
> > 
> > I don't think that "H" is unavailable in any encoding therefore it 
> > will always be transliterated as "H" and never as "X".  We can't
> > help it and I don't think it is bad.
> > 
> But we can keep this for when/if there is a way to explicitly request
> transcription instead of transliteration.

Note that either it will make the test cases fail or we will have to
prepare the test cases deliberately skip the translation of <U0425>
into "X" because "H" will be always working.  We can't force iconv
to choose the second transliteration rule if the first one works.

That means we will have a problem to construct the test cases.

> >> +% CYRILLIC CAPITAL LETTER TSE +<U0426> <U0043>;"<U0043><U005A>"
> > 
> > 1. "CZ" - maybe should be "Cz"?> 2. Are we able to implement the
> > rule: "c before i, e, y, j"?
> > 
> 1. see for CYRILLIC CAPITAL LETTER IO
> 2. not sure what you are talking about in 2. but I believe it's not
> possible as per Marko's email.

Hm... I can't find a good example now.  Maybe I was mislead by the rules
of Cyrillic transliteration which I learned at school and which are not
necessarily universal and not necessarily useful for English readers.

> >> +% CYRILLIC CAPITAL LETTER HARD SIGN +<U042A> 
> >> <U02BA>;"<U0041><U0060>"
> > 
> > "A`" is only for Bulgarian and should go to bg_BG.  How should we 
> > transliterate an upper case hard sign to plain ASCII?  I think that 
> > just "``", same as lower case.
> This is to avoid collision.

What collision?

> Besides AFAIK e.g. in Russian there is no
> capital hard sign because there are no words starting with it.

True but it can be used in ALL UPPERCASE text.  Therefore we need a clear
and correct transliteration rule for it.

> 
> > 
> >> +% CYRILLIC CAPITAL LETTER YERU +<U042B> <U0059>;"<U0059><U0060>"
> > 
> > Again, as "Y" is always available it will never be transliterated as 
> > "Y`".
> > 
> But we can keep this for when/if there is a way to explicitly request
> transcription instead of transliteration.

Again, it will be difficult or impossible to construct a correct test case
and we must be aware of this.

Regards,

Rafal


[1] https://sourceware.org/ml/libc-alpha/2018-10/msg00300.html
[2] https://sourceware.org/ml/libc-alpha/2018-10/msg00213.html
[3] https://sourceware.org/glibc/wiki/Contribution%20checklist
[4] https://sourceware.org/bugzilla/show_bug.cgi?id=2872#c47
[5] https://sourceware.org/ml/libc-alpha/2018-10/msg00232.html


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]