This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Improved check-localedef script
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Rafal Luzynski <digitalfreak at lingonborough dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, Zack Weinberg <zackw at panix dot com>
- Date: Fri, 04 Aug 2017 11:32:08 +0200
- Subject: Re: Improved check-localedef script
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 63CDA61466
- References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com> <616436394.552204.1501838084650@poczta.nazwa.pl>
Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
>> localedata/locales/cs_CZ... (charset: iso8859-2)
>> localedata/locales/cs_CZ:477: string not representable in iso8859-2:
>> 00C6 00C6
>> localedata/locales/cs_CZ:478: string not representable in iso8859-2:
>> 00C6 00C6
>> [cut the rest]
>
> These are the collating tables. Necessary for UTF-8 but I'm not sure
> what to do with them in 8-bit charset.
The cs_CZ collation tables contain many characters not from the Czech
language.
Line 477 has Æ U+00C6:
<U00C6> "<U0041><U0045>";"<U00C6><U00C6>";"<CAPITAL><CAPITAL>";"<U0041><U0045>"
I am surprised that the script doesn’t print even more warnings,
it doesn’t print warnings for these:
% katakana/hiragana sorting
% base is katakana, as this is present in most charsets
% normal before voiced before semi-voiced
% small vocals before normal vocals
% katakana before hiragana
<U30A1> <U30A1>;<U30A1>;IGNORE;<U30A1>
...
The cs_CZ source file is already UTF-8 encoded.
Does that mean that we should replace
“% Charset: ISO_8859-2:1987” with “% Charset: UTF-8”?
I am a bit confused now what this “% Charset” comment is supposed to
mean.
Does it really indicate the encoding used to write the source file?
Or does it mean something else??
> I think the conversion scripts
> should skip the unrepresentable characters.
--
Mike FABIAN <mfabian@redhat.com>