This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Improved check-localedef script

From: Mike FABIAN <mfabian at redhat dot com>
To: Zack Weinberg <zackw at panix dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>, Rafal Luzynski <digitalfreak at lingonborough dot com>
Date: Fri, 04 Aug 2017 08:42:43 +0200
Subject: Re: Improved check-localedef script
Authentication-results: sourceware.org; auth=none
Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 37D304A71F
References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com>

Zack Weinberg <zackw@panix.com> wrote:

> Here is an improved version of the check-localedef script I posted the
> other week.  It now takes only about 1.5 seconds to process all the
> files in localedata/locales/ (instead of seven seconds with the old
> parser), which is fast enough that I think it would be reasonable to
> run it during 'make check'.  Also, many bugs have been fixed.
> Especially, the "can we encode this string in the charset that the
> file is annotated with" test now actually _runs_...

Great!

> ... and finds dozens and dozens of errors. The full list is attached,
> but here's a small sample:
>
> localedata/locales/ur_PK... (charset: cp1256)
>   localedata/locales/ur_PK:114: string not representable in cp1256:
>       062C 0646 0648 0631 06CC
>   localedata/locales/ur_PK:115: string not representable in cp1256:
>       0641 0631 0648 0631 06CC
>   localedata/locales/ur_PK:117: string not representable in cp1256:
>       0627 067E 0631 06CC 0644
>
> These are the abmon strings, so I think it really would be a problem...

This is the first abmon string:

    abmon	"جنوری";/

The last letter in this string, ی U+06CC ARABIC LETTER FARSI YEH
is not convertible to CP1256.

But this letter seems to be really used in writing Urdu, see:

    https://en.wikipedia.org/wiki/Urdu_alphabet
    https://en.wikipedia.org/wiki/Urdu_alphabet#Ye

So I think CP1256 is not a suitable charset to use for Urdu.

    https://en.wikipedia.org/wiki/Windows-1256

says:

Wikipedia> Windows-1256 is a code page used to write Arabic (and possibly some

Note the “possibly”.

Wikipedia> other languages that use Arabic script, like Persian and Urdu) under
Wikipedia> Microsoft Windows.
Wikipedia> [...]
Wikipedia> Unicode and UTF-8 are preferred to Windows 1256 in modern
Wikipedia> applications. 0.1% of all web pages use Windows-1256 in June 2016.

So CP1256 doesn’t seem to be used much anymore.

And we don’t have a Urdu locale in that encoding either, our Urdu
locale uses only UTF-8 encoding.

So I think we should replace

    % Charset: CP1256

with 

    % Charset: UTF-8

in ur_PK.

-- 
Mike FABIAN <mfabian@redhat.com>

Follow-Ups:
- Re: Improved check-localedef script
  - From: Rafal Luzynski
- Re: Improved check-localedef script
  - From: Luis Javier Merino

References:
- Improved check-localedef script
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]