This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

From: Rafal Luzynski <digitalfreak at lingonborough dot com>
To: Egor Kobylkin <egor at kobylkin dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org
Cc: Marko Myllynen <myllynen at redhat dot com>
Date: Sat, 8 Dec 2018 02:15:40 +0100 (CET)
Subject: Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <b82fe65b-b880-a2b5-c97d-2a6aae9c1165@kobylkin.com> <837001401.21346.1542406647888@poczta.nazwa.pl> <bef63562-09d1-3306-aae9-20002ccf4130@kobylkin.com> <5a247161-c498-ed50-ff4a-58f2ecf974f0@redhat.com> <1441622134.517912.1543702039942@poczta.nazwa.pl> <2f6fc82c-77ba-d331-ae5d-e2373e122a88@kobylkin.com>

17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.

True.

> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
> 
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.

Sounds like a good idea to provide those two files:

* translit_cyrillic_system_a,
* translit_cyrillic_system_b,

(or any other pair of names) and let the individual locales choose whether
they want to include System A or System B.  For optimization, system_b
file could include system_a and modify it.

> The SH/Sh can be decided on either way - seems like an easy change any
> way.

I'm in favor of "Sh" because it will work fine for titlecased words
(where only the first letter is uppercase) but I'm aware it would be
a problem for uppercased words.  Unfortunately, I think we are unable
to satisfy both cases.

> On 16.11.18 23:17, Rafal Luzynski wrote:
> 
> > Egor, while at this I was thinking about your idea to transliterate
> > letters like "Ш" (uppercase) to "SH" (always uppercase) in order to
> > distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or
> > "Sxema").
> 
> to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
> ASCII//TRANSLIT (i.e. System B transcription).

True.

> But it's not only SH/Sh, there are following combinations used to
> transcribe capital letters:
> 
> YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ

Absolutely true.  I skip the whole list only for the brevity: if we
find a solution for one letter the same solution will work fine for
all others.

> [...]
> With transcription we are basically striping information from the data,
> mapping it into a smaller character set. The idea to keep them in
> CAP/CAP is to try to preserve as much information as possible.

I'm only afraid that things like "TWo CApitals" or "CamelCase" are
common among us computer geeks while they do not look great when
working with natural language and when displaying them to regular users
and even non-computer people.

> [...]
> So in fact we have two rules for each letter in the same file (System A
> and System B), where System A takes precedence.
> 
> I have a question then: isn't this more like a hack than a right thing
> to do?
> 
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?

It's impossible with the current API of iconv.  Maybe it would be
possible ever in future but that's a greater amount of work than what
we are doing here now.  Again, for now different set of rules = different
locale.

I have another question: is it really a job of transliteration to preserve
all original information, to ensure no collisions and have the ability to
restore the original text?  I'm afraid that as long as plain ASCII is the
destination charset whatever system we provide it will always be possible
to provide a malicious combination of the Cyrillic characters proving that
the system generates collisions.

> > I still don't like the idea to
> > put two uppercase letters in a beginning of a word in titlecase only
> > to indicate that there was originally a single letter.  What if we:
> > 
> > * drop the rule of transliterating "Х" to "H" and transliterate
> > always to "X",
> This would contradict ISO 9.1995. (System A).

Yes, it would.  I'm trying to find solution here since I think we have
proved that we can't implement a system which will handle System A,
System B, and ensure no collisions at the same time.  At least one
requirement must be dropped (at least partially).

> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.

I think I didn't see this Marko's request but I'm in favor of keeping
System A, too.

Marko, it would be good to hear your opinion about System A vs. System B
again.

> [...]
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.

I read this that you are open to a solution which is inspired by some
standards but does not implement them fully due to our technical
limitations.

19.11.2018 10:21 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Marko,
> 
> Your example only covers _tansliteration_ to Latin Diacritics
> [...]
> while BZ #2872 is about _transcription_ to ASCII
> [...]
> 
> So again, you are asking to have ISO 9.1995. System A but the bug is
> about ISO 9.1995. System B (GOST 7.79-2000)

It's hard to say what the original bug reporter meant but I think that the
problem is that there is no transliteration from Cyrillic to any variant of
Latin, except in few locales.  If System A was implemented but System B was
not then at least some characters would be handled correctly.  Currently no
Cyrillic characters are handled.

19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> In any case once your patch lands I'm going to submit a follow-up patch
> for fi_FI to make it compliant with the applicable national standard
> (SFS 4900) which defines how to do Cyrillic transliteration /
> transcription in the context Finnish.

I totally agree.  As far as I can see, SFS 4900 is more similar to
System A (ISO 9) rather than System B, that is, it transliterates to Latin
characters with diacritics rather than plain ASCII.  Marko, what is your
opinion about possible implementation of SFS 4900 in these cases:

* When the destination charset does not contain required Latin diacritic
  characters (e.g., it is plain ASCII)?
* When the output is ambiguous, that means, when two different Cyrillic
  strings produce the same Latin (or ASCII) output?

At the moment I am not curious about SFS 4900 but we are facing the same
problems now with ISO 9 and GOST 7.79.

1.12.2018 23:07 Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
> [...]
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
> 
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement.  But definitely a good thing.

I forgot to include this test which is really interesting:

$ echo ХА Ха ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN    
KHA Kha kha

which again confirms that the choice of all uppercase or just the first
letter uppercased is context based, a thing which we can't implement now.

1.12.2018 23:53 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> On 01.12.18 23:07, Rafal Luzynski wrote:
> > 
> > [...]
> > This makes me think: should we add a locale like ru_RU@SystemA or
> > ru_RU@SystemB?
> 
> Wouldn't it require to create 3 versions of every locale that would
> include the translit_cyrillic file then? I.e. en_US + en_US@SystemA,
> en_US@SystemB etc.?

OK, please read this as another brainstorming idea and let's just
forget it.

> [...]
> An example from my experience as a user - a networked device or host
> would often have the en_US as the default (only?) locale with no viable
> way to change it or install cyrillic fonts. Anyway, this is the most
> dire situation where the ASCII transliteration certainly helps most.
> Having en_US@SystemA or en_US@SystemB theoretically available but not
> compiled by the distributor wouldn't help here, would it?
> 
> So the only useful scenario here would be to ship your locales with the
> transliteration already included by default in en_US. This way the
> distributor won't have to get active to include transliteration as
> en_US@SystemA or en_US@SystemB.

Having the idea of "@SystemA" and "@SystemB" dropped I don't think
implementing any solution in glibc would be helpful for your use case.
Two reasons:

1. I believe that sooner or later someone will develop a transliteration
   system for en_US which will follow English transliteration of Russian
   instead of any standard we are discussing here.  That means, it would
   transliterate 'Х' as 'Kh' rather than 'H' or 'X'.
2. Currently there is a trend not to install even en_US locales and leave
   only C which is hardcoded into glibc binaries.  OTOH, I wouldn't mind
   if ISO 9 was hardcoded into C as well.
3. That's beyond Russian language but transliteration according to Serbian
   or Bulgarian or Ukrainian or Kazakh rules still requires installing their
   proper locales.  I think that requiring ru_RU to be installed could be
   reasonable especially if we end up with ru_RU somehow differing from
   the default "translit_cyrillic".

BTW you don't need Cyrillic fonts to be installed on your server in order
to process the Cyrillic text correctly unless your server renders the text.

3.12.2018 23:19 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> Rafal,
> 
> Just to touch base on this, what is the best way forward? Did you get
> any input/feedback on your questions below? Are you expecting input from
> anyone but myself?

Yes, I expected some input from more experienced maintainers about whether
and how to write the tests but I'd rather start another thread about it
because this one is too long already.

> On the blocking issue #2: I really don’t see the connection to the uk_UA
> locale that has its transliteration table inline and is explicitly
> excluded from my patch. It may be revealing  another issue you have with
> glibc but wouldn’t that be better addressed in a new bug?

OK, I was not precise enough (I'm sorry about it) so I'd like to explain
here:

1. In the long term goal I would like to convert those excluded locales
   to use your translit_cyrillic as well.
2. In order to ensure that change is not destructive for them I will need
   automatic tests to prove that their transliteration rules work the
   same good before the change and after the change.
3. It does not matter that converting those other locales is in a distant
   future because we need the same tests for Russian language now.
4. Even although I have not started writing any tests I can see they
   will be failing for uk_UA.  The reason is that glibc transliteration
   rules can handle transliterating single characters into single
characters,
   single characters into multiple characters but not multiple characters
   into multiple (or even single) characters.
5. We can ignore uk_UA but we will face the same case in ru_RU where
   you had a case of 'У́ ' ('У' + 'COMBINING ACUTE ACCENT').
6. So the question was: how (and whether) to write the tests if we
   already know they would be failing?  Skip them?  Resolve the other
   issue first?  Mark them as XFAIL?

In the meantime, you have removed the controversial conversion rule
of 'У' with the acute accent:

> Again, in the v10 of my patch I have removed multicharacter source
> graphemes, so that issue is moot there.

so we can move to the next step.

> If you’d like to overhaul the glibc translit system wouldn’t it be
> better to commit the simple text file with the Cyrillic
> translit(transcription) table first, fix the bug from the year 2006 and
> then proceed from there all due diligence?

I agree and we are now one step forward.

> The same with having both System A and System B.  Initially I went along
> with the suggestion to include the system A but it is clear now that it
> doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
> to set it aside for the moment and use the v10 without the system A.
> That is the whole reason I have submitted it, to be superclear on that.

OK, I think that now I understand your reason to drop System A better.
But still I'd like to rethink implementing System A somehow and drop
(or rather: implement only partially) System B.

> Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
> should mitigate your concern about that issue too (somewhat, anyway).
> Making it context based would also be about adding new code, see above.

It would also require the changes in the syntax of the source code
of locale data and possibly breaking the POSIX compatibility which
I think would be unacceptable.

> Let me know if there’s anything I can help with getting more progress
> with the decision

I'm afraid you can't help more.  I'd like to hear some feedback from other
people.  Due to some minor obstacles we can't resolve this issue being only
two here.

Regards,

Rafal

Follow-Ups:
- Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Marko Myllynen

References:
- Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Rafal Luzynski
- Re: [PATCH v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]