This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improved check-localedef script


Rafal Luzynski <digitalfreak@lingonborough.com> wrote:

>> localedata/locales/cs_CZ... (charset: iso8859-2)
>>   localedata/locales/cs_CZ:477: string not representable in iso8859-2:
>>       00C6 00C6
>>   localedata/locales/cs_CZ:478: string not representable in iso8859-2:
>>       00C6 00C6
>> [cut the rest]
>
> These are the collating tables.  Necessary for UTF-8 but I'm not sure
> what to do with them in 8-bit charset.

The cs_CZ collation tables contain many characters not from the Czech
language.

Line 477 has Æ U+00C6:

<U00C6>	"<U0041><U0045>";"<U00C6><U00C6>";"<CAPITAL><CAPITAL>";"<U0041><U0045>"

I am surprised that the script doesn’t print even more warnings,
it doesn’t print warnings for these:

% katakana/hiragana sorting
% base is katakana, as this is present in most charsets
% normal before voiced before semi-voiced
% small vocals before normal vocals
% katakana before hiragana

<U30A1>	<U30A1>;<U30A1>;IGNORE;<U30A1>
...

The cs_CZ source file is already UTF-8 encoded.

Does that mean that we should replace 

“% Charset: ISO_8859-2:1987” with “% Charset: UTF-8”?

I am a bit confused now what this “% Charset” comment is supposed to
mean.

Does it really indicate the encoding used to write the source file?
Or does it mean something else??

> I think the conversion scripts
> should skip the unrepresentable characters.

-- 
Mike FABIAN <mfabian@redhat.com>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]