This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Improved check-localedef script

From: Rafal Luzynski <digitalfreak at lingonborough dot com>
To: Mike FABIAN <mfabian at redhat dot com>, Zack Weinberg <zackw at panix dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>
Date: Fri, 4 Aug 2017 10:06:47 +0200 (CEST)
Subject: Re: Improved check-localedef script
Authentication-results: sourceware.org; auth=none
References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com> <s9dfud7j0kc.fsf@redhat.com>
Reply-to: Rafal Luzynski <digitalfreak at lingonborough dot com>

4.08.2017 08:42 Mike FABIAN <mfabian@redhat.com> wrote:
>
>
> Zack Weinberg <zackw@panix.com> wrote:
>
> [...]
> > ... and finds dozens and dozens of errors. The full list is attached,
> > but here's a small sample:
> >
> > localedata/locales/ur_PK... (charset: cp1256)
> > localedata/locales/ur_PK:114: string not representable in cp1256:
> > 062C 0646 0648 0631 06CC
> > localedata/locales/ur_PK:115: string not representable in cp1256:
> > 0641 0631 0648 0631 06CC
> > localedata/locales/ur_PK:117: string not representable in cp1256:
> > 0627 067E 0631 06CC 0644
> >
> > These are the abmon strings, so I think it really would be a problem...
>
> This is the first abmon string:
>
> abmon "جنوری";/
>
> The last letter in this string, ی U+06CC ARABIC LETTER FARSI YEH
> is not convertible to CP1256.
> [...]

This "Charset: CP1256" is just a comment.  Is it used anywhere? I don't
think so.  I think that localedata/SUPPORTED file is relevant and it
requires ur_PK (and ur_IN as well) to be converted to UTF-8 only.

> [...]
> So I think we should replace
>
> % Charset: CP1256
>
> with
>
> % Charset: UTF-8
>
> in ur_PK.

The file currently is in pure 7-bit ASCII.  Do we need this line
at all?  What about removing it?  If it should not be removed then
maybe let's consider ASCII.  UTF-8 is good if ASCII cannot be used.
Actually, CP1256 is also true but misleading, the file uses an ASCII
charset which is a common subset of many other subsets.  The only
problem is that CP1256 is misleading and causes those false positives.

TL;DR: my suggestions are (in the order of my preference):

- remove this line,
- replace with % Charset: ASCII
- replace with % Charset: UTF-8
- leave unchanged,
- feel free to post your own suggestion.

Regards,

Rafal

Follow-Ups:
- Re: Improved check-localedef script
  - From: Mike FABIAN

References:
- Improved check-localedef script
  - From: Zack Weinberg
- Re: Improved check-localedef script
  - From: Mike FABIAN

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]