This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

From: Egor Kobylkin <egor at kobylkin dot com>
To: libc-alpha at sourceware dot org, libc-locales at sourceware dot org, "Dmitry V. Levin" <ldv at altlinux dot org>, Marko Myllynen <myllynen at redhat dot com>, mfabian at redhat dot com
Date: Thu, 20 Dec 2018 00:02:21 +0100
Subject: Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <676c37bd-ba92-a7ed-019e-94974143233f@kobylkin.com> <1718190635.706992.1544225756803@poczta.nazwa.pl> <a9ca708e-f2af-14f3-9871-df213580882d@kobylkin.com> <749726562.674232.1545259279320@poczta.nazwa.pl>

On 19.12.18 23:41, Rafal Luzynski wrote:
> 8.12.2018 22:51 Egor Kobylkin <egor@kobylkin.com> wrote:
>>
>> Rafal, Dmitry, Marko, Mike
>>
>> On 08.12.18 00:35, Rafal Luzynski wrote:
>>> 19.11.2018 12:10 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>
>>>> Changelog v10: * Removed ISO 9.1995 GOST 7.79-2000 System A
>>>> (transliteration to Latin with diacritics) as conflicting with
>>>> System B within glibc mechanics and not solving BZ #2872
>>>
>>> I'm in favor of implementing System A and dropping System B instead.
>>
>> The BZ #2872 bug name is explicitly "Transliteration Cyrillic -> ASCII
>> fails". The ISO 9 System A does not map to ASCII so it is not a solution
>> to BZ #2872 at all.
> 
> I did not mean implementing System A and nothing more.  I meant implementing
> System A and a fallback for ASCII which can be similar to System B but
> we wouldn't be able to call it "System B" because it would differ in
> few cases.
Just for the record, I have no objection on my side to that (Using A as
a basis for ASCII as well).

But I'm not sure anymore that inserting a translit table into every
locale is the right solution for ASCII problem. Especially because
distributions may not include any locale but C.

> 
>> I was scratching my head as to how can we avoid the explosion of the
>> scope for this patch. And then it appeared to me that it was wrong to
>> target all the present locales for the ASCII translit. This seems to be
>> the root cause for this prolonged A vs. B discussions. The proper target
>> for my table is actually the C locale translit file
>> (locale/C-translit.h.in). I will submit a proper patch shortly.
> 
> I saw your patch v11 and now I must say I'm sorry for making noise because
> it was me who said that I didn't mind adding Cyrillic -> ASCII
> transliteration
> to C locale.  I said so before taking a look at the current contents of
> transliteration in C locale.  When I looked at this I realized that it does
> not support any national characters, even from modified Latin alphabets
> (like
> used in most of western European languages).  It only contains mathematical,
> physical, commercial, diacritical etc. characters.  So I'm no longer sure
> it should support Cyrillic -> ASCII.  But maybe again I'm wrong, maybe
> it should support but just nobody implemented it yet.

Actually there are quite a few letters already transliterated in
locale/C-translit.h.in. (Note the CAPCAP transliteration style for the
capitals, i.e. LATIN CAPITAL LETTER AE is mapped to AE, not to Ae.)

"\x00c6"	"AE"	/* <U00C6> LATIN CAPITAL LETTER AE */
"\x00d7"	"x"	/* <U00D7> MULTIPLICATION SIGN */
"\x00df"	"ss"	/* <U00DF> LATIN SMALL LETTER SHARP S */
"\x00e6"	"ae"	/* <U00E6> LATIN SMALL LETTER AE */
"\x0132"	"IJ"	/* <U0132> LATIN CAPITAL LIGATURE IJ */
"\x0133"	"ij"	/* <U0133> LATIN SMALL LIGATURE IJ */
"\x0149"	"'n"	/* <U0149> LATIN SMALL LETTER N PRECEDED BY APOSTROPHE */
"\x0152"	"OE"	/* <U0152> LATIN CAPITAL LIGATURE OE */
"\x0153"	"oe"	/* <U0153> LATIN SMALL LIGATURE OE */
"\x017f"	"s"	/* <U017F> LATIN SMALL LETTER LONG S */
"\x01c7"	"LJ"	/* <U01C7> LATIN CAPITAL LETTER LJ */
"\x01c8"	"Lj"	/* <U01C8> LATIN CAPITAL LETTER L WITH SMALL LETTER J */
"\x01c9"	"lj"	/* <U01C9> LATIN SMALL LETTER LJ */
"\x01ca"	"NJ"	/* <U01CA> LATIN CAPITAL LETTER NJ */
"\x01cb"	"Nj"	/* <U01CB> LATIN CAPITAL LETTER N WITH SMALL LETTER J */
"\x01cc"	"nj"	/* <U01CC> LATIN SMALL LETTER NJ */
"\x01f1"	"DZ"	/* <U01F1> LATIN CAPITAL LETTER DZ */
"\x01f2"	"Dz"	/* <U01F2> LATIN CAPITAL LETTER D WITH SMALL LETTER Z */
"\x01f3"	"dz"	/* <U01F3> LATIN SMALL LETTER DZ */


>> My focus is super sharp on helping with Cyrillic -> ASCII translit
>> availability for a default installation with glibc.
> 
> I understand your aim and I agree to support ASCII.  Our disagreements are:
> 
> * whether to support conversion Cyrillic -> extended Latin as well,
no contest on my side
> * which standard to implement,
no contest on my side
> * what to do if the standard is ambiguous or if some details cannot be
>   implemented for technical reasons.
no contest on my side either

I just think we may work around all those decisions with a smaller pure
ASCII patch first (more useful too if covers C locale).

Follow-Ups:
- Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Rafal Luzynski

References:
- Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Rafal Luzynski
- Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin
- Re: [PATCH v10] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Rafal Luzynski

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]