This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

From: Egor Kobylkin <egor at kobylkin dot com>
To: Marko Myllynen <myllynen at redhat dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>, libc-alpha at sourceware dot org, libc-locales at sourceware dot org
Cc: mfabian at redhat dot com, "Dmitry V. Levin" <ldv at altlinux dot org>, Volodymyr Lisivka <vlisivka at gmail dot com>, Max Kutny <mkutny at gmail dot com>, danilo at gnome dot org
Date: Mon, 15 Oct 2018 13:54:53 +0200
Subject: Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
References: <41532e13-a63d-5df1-ab37-05eb4d6c8d0a@kobylkin.com> <20180412224352.GB2911@altlinux.org> <d5582688-819b-90c2-3f4a-0d19c932d487@kobylkin.com> <165238610.582597.1539392357757@poczta.nazwa.pl> <e072a70c-9962-4087-93c2-06ec3c9a0b1f@kobylkin.com> <1374aef3-4c16-b9cd-49a6-b6da9b1a9eeb@redhat.com>

On 15.10.2018 13:04, Marko Myllynen wrote:
> Hi,
> 
> On 2018-10-13 19:58, Egor Kobylkin wrote:
>> On 13.10.2018 02:59, Rafal Luzynski wrote:
>>
>>> Regarding the tests, I think there is no complete transliteration 
>>> test suite at the moment.  Probably the only test is 
>>> localedata/bug-iconv-trans.c. You can also see the collation tests 
>>> placed in the same directory, they use those multiple *.UTF-8.in 
>>> files.
>>>
>>> You can skip the tests for now.
>>
>> First I though they could just be added but not all locales
>> transliterate Umlauts so just extending the current test won't do as it
>> will fail for those locales.
> 
> I still think a one-time check against uconv(1) (part of Unicode's ICU
> project) for discrepancies.

Just an addition. I have changes a few constants to see whether
localedata/bug-iconv-trans.c could be made to test cyrillic. Attached is
the bug-iconv-trans-cyr.c that goes through in this form. I had to save
it as UTF-8 instead of ISO-8859-15 for localedata/bug-iconv-trans.c.

>>>> [...] diff -uNr a/localedata/locales/am_ET 
>>>> b/localedata/locales/am_ET --- a/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:11.000000000 +0000 +++ b/localedata/locales/am_ET 
>>>> 2018-10-11 15:10:43.000000000 +0000 @@ -1394,6 +1394,7 @@ <U137A> 
>>>> <U0060><U0039><U0030> <U137B> <U0060><U0031><U0030><U0030> <U137C> 
>>>> <U0060><U0031><U0030><U0030><U0030><U0030> +include 
>>>> "translit_cyrillic";"" translit_end % END LC_CTYPE
>>>
>>> Shouldn't “include "translit_cyrillic";""” be placed before the 
>>> custom rules, together with other includes?  The same in more files, 
>>> I will not mention them all.
>>
>> If I recall correctly it is because of the
>> "translit_end
>> END LC_CTYPE"
>> part at the end of the translit_cyrillic. This way it works for any
>> locale, regardless whether it has translit itself or not. And being at
>> the end it does not supersede any previous transliteration that may be
>> there for a reason.
> 
> I suspect one problem would be that the latter rule wins, so if there
> are some locale-specific rules than possible translit_* inclusions would
> override them if not included before the locale-specific rules.

What is the best way forward here? Can somebody make an explicit
suggestion on how to change the current approach if needed?

Bests,
Egor

#include <iconv.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

int
main (void)
{
  iconv_t cd;
  const char str[] = "CyrillicLetters_Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð?Ð Ð¡Ð¢Ð£Ð£Ì?Ð¤Ð¥Ð¦Ð§Ð¨Ð©ÐªÐ«Ð¬ÐÐ®Ð¯Ð°Ð±Ð²Ð³Ð´ÐµÐ¶Ð·Ð¸Ð¹ÐºÐ»Ð¼Ð½Ð¾Ð¿Ñ?Ñ?Ñ?Ñ?Ñ?Ì?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?Ñ?ÑªÑ«Ñ²Ñ³Ñ´ÑµÒ?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò?Ò¢Ò£Ò¤Ò¥Ò¦Ò§Ò¨Ò©ÒªÒ«Ò¬ÒÒ®Ò¯Ò²Ò³Ò´ÒµÒºÒ»Ò¼Ò½Ò¾Ò¿Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó?Ó Ó¡Ó¤Ó¥Ó¦Ó§Ó¨Ó©Ó°Ó±Ó²Ó³Ó´ÓµÓ¸Ó¹â??";
  const char expected[] = "CyrillicLetters_YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e`G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'";
  char *inptr = (char *) str;
  size_t inlen = strlen (str) + 1;
  char outbuf[500];
  char *outptr = outbuf;
  size_t outlen = sizeof (outbuf);
  int result = 0;
  size_t n;

  if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
    {
      puts ("setlocale failed");
      return 1;
    }

  cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "UTF-8");
  if (cd == (iconv_t) -1)
    {
      puts ("iconv_open failed");
      return 1;
    }

  n = iconv (cd, &inptr, &inlen, &outptr, &outlen);
  if (n != 174)
    {
      if (n == (size_t) -1)
	printf ("iconv() returned error: %m\n");
      else
	printf ("iconv() returned %Zd, expected 7\n", n);
      result = 1;
    }
  if (inlen != 0)
    {
      puts ("not all input consumed");
      result = 1;
    }
  else if (inptr - str != strlen (str) + 1)
    {
      printf ("inptr wrong, advanced by %td\n", inptr - str);
      result = 1;
    }
  if (memcmp (outbuf, expected, sizeof (expected)) != 0)
    {
      printf ("result wrong: \"%.*s\", expected: \"%s\"\n",
	      (int) (sizeof (outbuf) - outlen), outbuf, expected);
      result = 1;
    }
  else if (outlen != sizeof (outbuf) - sizeof (expected))
    {
      printf ("outlen wrong: %Zd, expected %Zd\n", outlen,
	      sizeof (outbuf) - 15);
      result = 1;
    }
  else
    printf ("output is \"%s\" which is OK\n", outbuf);

  return result;
}

References:
- [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin
- Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Rafal Luzynski
- Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Egor Kobylkin
- Re: [PATCH v5] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
  - From: Marko Myllynen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]