This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Improved check-localedef script

From: Mike FABIAN <mfabian at redhat dot com>
To: Rafal Luzynski <digitalfreak at lingonborough dot com>
Cc: GNU C Library <libc-alpha at sourceware dot org>, Zack Weinberg <zackw at panix dot com>
Date: Fri, 04 Aug 2017 11:32:08 +0200
Subject: Re: Improved check-localedef script
Authentication-results: sourceware.org; auth=none
Authentication-results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 63CDA61466
References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com> <616436394.552204.1501838084650@poczta.nazwa.pl>

Rafal Luzynski <digitalfreak@lingonborough.com> wrote:

>> localedata/locales/cs_CZ... (charset: iso8859-2)
>>   localedata/locales/cs_CZ:477: string not representable in iso8859-2:
>>       00C6 00C6
>>   localedata/locales/cs_CZ:478: string not representable in iso8859-2:
>>       00C6 00C6
>> [cut the rest]
>
> These are the collating tables.  Necessary for UTF-8 but I'm not sure
> what to do with them in 8-bit charset.

The cs_CZ collation tables contain many characters not from the Czech
language.

Line 477 has Æ U+00C6:

<U00C6>	"<U0041><U0045>";"<U00C6><U00C6>";"<CAPITAL><CAPITAL>";"<U0041><U0045>"

I am surprised that the script doesn’t print even more warnings,
it doesn’t print warnings for these:

% katakana/hiragana sorting
% base is katakana, as this is present in most charsets
% normal before voiced before semi-voiced
% small vocals before normal vocals
% katakana before hiragana

<U30A1>	<U30A1>;<U30A1>;IGNORE;<U30A1>
...

The cs_CZ source file is already UTF-8 encoded.

Does that mean that we should replace 

“% Charset: ISO_8859-2:1987” with “% Charset: UTF-8”?

I am a bit confused now what this “% Charset” comment is supposed to
mean.

Does it really indicate the encoding used to write the source file?
Or does it mean something else??

> I think the conversion scripts
> should skip the unrepresentable characters.

-- 
Mike FABIAN <mfabian@redhat.com>

Follow-Ups:
- Re: Improved check-localedef script
  - From: Zack Weinberg

References:
- Improved check-localedef script
  - From: Zack Weinberg
- Re: Improved check-localedef script
  - From: Rafal Luzynski

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]