This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Improved check-localedef script
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Zack Weinberg <zackw at panix dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, Rafal Luzynski <digitalfreak at lingonborough dot com>
- Date: Fri, 04 Aug 2017 08:42:43 +0200
- Subject: Re: Improved check-localedef script
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 37D304A71F
- References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com>
Zack Weinberg <zackw@panix.com> wrote:
> Here is an improved version of the check-localedef script I posted the
> other week. It now takes only about 1.5 seconds to process all the
> files in localedata/locales/ (instead of seven seconds with the old
> parser), which is fast enough that I think it would be reasonable to
> run it during 'make check'. Also, many bugs have been fixed.
> Especially, the "can we encode this string in the charset that the
> file is annotated with" test now actually _runs_...
Great!
> ... and finds dozens and dozens of errors. The full list is attached,
> but here's a small sample:
>
> localedata/locales/ur_PK... (charset: cp1256)
> localedata/locales/ur_PK:114: string not representable in cp1256:
> 062C 0646 0648 0631 06CC
> localedata/locales/ur_PK:115: string not representable in cp1256:
> 0641 0631 0648 0631 06CC
> localedata/locales/ur_PK:117: string not representable in cp1256:
> 0627 067E 0631 06CC 0644
>
> These are the abmon strings, so I think it really would be a problem...
This is the first abmon string:
abmon "جنوری";/
The last letter in this string, ی U+06CC ARABIC LETTER FARSI YEH
is not convertible to CP1256.
But this letter seems to be really used in writing Urdu, see:
https://en.wikipedia.org/wiki/Urdu_alphabet
https://en.wikipedia.org/wiki/Urdu_alphabet#Ye
So I think CP1256 is not a suitable charset to use for Urdu.
https://en.wikipedia.org/wiki/Windows-1256
says:
Wikipedia> Windows-1256 is a code page used to write Arabic (and possibly some
Note the “possibly”.
Wikipedia> other languages that use Arabic script, like Persian and Urdu) under
Wikipedia> Microsoft Windows.
Wikipedia> [...]
Wikipedia> Unicode and UTF-8 are preferred to Windows 1256 in modern
Wikipedia> applications. 0.1% of all web pages use Windows-1256 in June 2016.
So CP1256 doesn’t seem to be used much anymore.
And we don’t have a Urdu locale in that encoding either, our Urdu
locale uses only UTF-8 encoding.
So I think we should replace
% Charset: CP1256
with
% Charset: UTF-8
in ur_PK.
--
Mike FABIAN <mfabian@redhat.com>