This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]

From: Siddhesh Poyarekar <siddhesh at gotplt dot org>
To: Egor Kobylkin <egor at kobylkin dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>, Carlos O'Donell <carlos at redhat dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org
Date: Thu, 3 Jan 2019 19:11:09 +0530
Subject: Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
References: <abf0875c-a9e7-0867-4f2a-67265c36f091@kobylkin.com>

On 03/01/19 4:52 PM, Egor Kobylkin wrote:

Is there a specific way you measure the bloat of the C locale?
Is it the size of the resulting libc.so.6 file we are concerned with?

I believe it's built into libc.so, so I suppose you'd have to look atits file size.

In terms of the source code we are just adding as many lines as there
are letters (169 insertions for Cyrillic in this patch v12)

Yeah, it will likely not be much for a single locale, but it may add upacross locales. I have no idea how much, it may well be insignificant.

Just for clarification, the whole point (at least for me) for this patch
is to have the transliteration when other methods are not available. Or
when existing programs/systems can not make use of them. The most basic
example: filenames in Cyrillic on a NAS that get converted to
????????.??? and get overwritten in the worst case. So the most value is
when it works out of box with the C builtin. Other locales can actually
implement their own variant and explicitly use it if they need one; some
already have, others may be just fine with the builtin C.

That's a fair point but given the approximation, that specific use casemay still be flaky.

Additionally we have a disagreement about how should we handle the
 case when a single original uppercase character transliterates
into a digraph in ASCII.  Should both ASCII characters be
uppercase (which is good for all uppercase strings and also good to
emphasize that the original character was single rather than two
separate characters which accidentally transliterate into two
characters making a digraph) or should only the first ASCII
character be uppercase (which is good for the titlecase words which
is common in natural texts)? An example is "Ш" - should it be "SH"
or "Sh"? Note that "Сх" may also produce "Sh" ("S" + "h" -> "Sh").



Is that important?


As in the above example about the files, you would probably agree that
it's better not to knowingly introduce a failure vector for such basic
OS operations like working with files. The transliteration
capitalization collisions have this negative potential. The users that
need a different specific capitalization can still implement that in
their locale.

OK, I can see reason for reducing collisions but again, it remainsflaky. We could in the interest of moving forward, strive towardsmaking it less flaky but at the same time be aware that there mayeventually be collisions.


Siddhesh

Follow-Ups:
- Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  - From: Egor Kobylkin

References:
- Re: Help needed reviewing Cyrillic -> ASCII transliteration [BZ #2872]
  - From: Egor Kobylkin

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]