This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Improved check-localedef script
- From: Mike FABIAN <mfabian at redhat dot com>
- To: Rafal Luzynski <digitalfreak at lingonborough dot com>
- Cc: Zack Weinberg <zackw at panix dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Fri, 04 Aug 2017 11:50:00 +0200
- Subject: Re: Improved check-localedef script
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mfabian at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 587CB8124F
- References: <CAKCAbMjLN7SMWwveXVokSCttqso+r+1AttpFEpDBdJcSyiuQ4Q@mail.gmail.com> <s9d60e3bspn.fsf@redhat.com> <26692227.553011.1501838716734@poczta.nazwa.pl>
Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
> 4.08.2017 11:14 Mike FABIAN <mfabian@redhat.com> wrote:
>> But even though U+20AC cannot be converted to ISO-8859-1, the
>> ca_ES.ISO-8859-1 locale still works because it is transliterated:
>>
>> $ LC_ALL=ca_ES locale -k currency_symbol charmap
>> currency_symbol="EUR"
>> charmap="ISO-8859-1"
>>
>> So this does not cause an actual problem.
>
> So the "€" character is actually representable in ISO-8859-1 because
> we can convert it to "EUR". Looks like a false positive then.
Yes.
>> The ca_ES source file is not ASCII, it has
>>
>> % català
>> lang_name "<U0063><U0061><U0074><U0061><U006C><U00E0>"
>>
>> So maybe I could just convert the file to UTF-8
>> and change “% Charset: ISO-8859-1” into “% Charset: UTF-8”
>> to get rid of the check-localedef warning.
>>
>> Would that be OK?
>
> I think that no, it's not OK. If I understand correctly the
> "source file is ASCII" sentence means that the individual characters:
> '<', '2', '0', 'A', 'C', '>' are ASCII.
Yes.
> They may describe something more complex like <U00E0>. But even this
> is not UTF-8 because UTF-8 would be <C3> <A0> (UTF-8 is 8-bit). The
> closest charset would be UCS-2 or simply a generic Unicode.
My understanding at the moment is that the “% Charset: ...” comment
indicates the encoding used to write the source file. So something like
“<U20AC>” is definitely ASCII. Non-ASCII stuff in locale source files
seems to exist only in comments at the moment.
--
Mike FABIAN <mfabian@redhat.com>