This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
On 15.10.2018 13:04, Marko Myllynen wrote:
> Hi,
>
> On 2018-10-13 19:58, Egor Kobylkin wrote:
>> On 13.10.2018 02:59, Rafal Luzynski wrote:
>>
>>> Regarding the tests, I think there is no complete transliteration
>>> test suite at the moment. Probably the only test is
>>> localedata/bug-iconv-trans.c. You can also see the collation tests
>>> placed in the same directory, they use those multiple *.UTF-8.in
>>> files.
>>>
>>> You can skip the tests for now.
>>
>> First I though they could just be added but not all locales
>> transliterate Umlauts so just extending the current test won't do as it
>> will fail for those locales.
>
> I still think a one-time check against uconv(1) (part of Unicode's ICU
> project) for discrepancies.
Just an addition. I have changes a few constants to see whether
localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is
the bug-iconv-trans-cyr.c that goes through in this form. I had to save
it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c.
>>>> [...] diff -uNr a/localedata/locales/am_ET
>>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET
>>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET
>>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A>
>>>> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C>
>>>> <U0060><U0031><U0030><U0030><U0030><U0030> +include
>>>> "translit_cyrillic";"" translit_end % END LC_CTYPE
>>>
>>> Shouldn't “include "translit_cyrillic";""” be placed before the
>>> custom rules, together with other includes? The same in more files,
>>> I will not mention them all.
>>
>> If I recall correctly it is because of the
>> "translit_end
>> END LC_CTYPE"
>> part at the end of the translit_cyrillic. This way it works for any
>> locale, regardless whether it has translit itself or not. And being at
>> the end it does not supersede any previous transliteration that may be
>> there for a reason.
>
> I suspect one problem would be that the latter rule wins, so if there
> are some locale-specific rules than possible translit_* inclusions would
> override them if not included before the locale-specific rules.
What is the best way forward here? Can somebody make an explicit
suggestion on how to change the current approach if needed?
Bests,
Egor
#include <iconv.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
int
main (void)
{
iconv_t cd;
const char str[] = "CyrillicLetters_Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?РСТУУÌ?ФХЦЧШЩЪЫЬÐЮЯабвгдежзийклмнопÑ?Ñ?Ñ?Ñ?Ñ?Ì?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?ѪѫѲѳѴѵÒ?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò¢Ò£Ò¤Ò¥Ò¦Ò§Ò¨Ò©ÒªÒ«Ò¬ÒÒ®Ò¯Ò²Ò³Ò´ÒµÒºÒ»Ò¼Ò½Ò¾Ò¿Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó Ó¡Ó¤Ó¥Ó¦Ó§Ó¨Ó©Ó°Ó±Ó²Ó³Ó´ÓµÓ¸Ó¹â??";
const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
char *inptr = (char *) str;
size_t inlen = strlen (str) + 1;
char outbuf[500];
char *outptr = outbuf;
size_t outlen = sizeof (outbuf);
int result = 0;
size_t n;
if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
{
puts ("setlocale failed");
return 1;
}
cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
if (cd == (iconv_t) -1)
{
puts ("iconv_open failed");
return 1;
}
n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
if (n != 174)
{
if (n == (size_t) -1)
printf ("iconv() returned error: %m\n");
else
printf ("iconv() returned %Zd, expected 7\n", n);
result = 1;
}
if (inlen != 0)
{
puts ("not all input consumed");
result = 1;
}
else if (inptr - str != strlen (str) + 1)
{
printf ("inptr wrong, advanced by %td\n", inptr - str);
result = 1;
}
if (memcmp (outbuf, expected, sizeof (expected)) != 0)
{
printf ("result wrong: \"%.*s\", expected: \"%s\"\n",
(int) (sizeof (outbuf) - outlen), outbuf, expected);
result = 1;
}
else if (outlen != sizeof (outbuf) - sizeof (expected))
{
printf ("outlen wrong: %Zd, expected %Zd\n", outlen,
sizeof (outbuf) - 15);
result = 1;
}
else
printf ("output is \"%s\" which is OK\n", outbuf);
return result;
}