This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: RFC: locale-source validation script
- From: Zack Weinberg <zackw at panix dot com>
- To: Mike FABIAN <mfabian at redhat dot com>
- Cc: Andreas Schwab <schwab at suse dot de>, GNU C Library <libc-alpha at sourceware dot org>, Mike FABIAN <maiku dot fabian at gmail dot com>, Rafal Luzynski <digitalfreak at lingonborough dot com>, "Carlos O'Donell" <carlos at redhat dot com>, Florian Weimer <fweimer at redhat dot com>
- Date: Wed, 26 Jul 2017 11:00:04 -0400
- Subject: Re: RFC: locale-source validation script
- Authentication-results: sourceware.org; auth=none
- References: <CAKCAbMj54Dq5gAvKrC1JXLU9fknoDVjbtK0iJutMvqhUOsjVhA@mail.gmail.com> <mvmk22vvdmn.fsf@suse.de> <s9dinifcod0.fsf@redhat.com>
On Wed, Jul 26, 2017 at 9:35 AM, Mike FABIAN <mfabian@redhat.com> wrote:
> Andreas Schwab <schwab@suse.de> wrote:
>
>> On Jul 25 2017, Zack Weinberg <zackw@panix.com> wrote:
>>
>>> - There are quite a few strings that aren't NFC and I suspect it's
>>> going to take expert knowledge of the languages involved to tell if
>>> that's desirable.
>>
>> I don't think NFC or not has anything to do with the language.
>
> I think not all occurrences of non NFC are necessarily an error,
> for example de_DE contains:
>
> LC_CTYPE
> copy "i18n"
>
> translit_start
>
> include "translit_combining";""
>
> % German umlauts.
> % LATIN CAPITAL LETTER A WITH DIAERESIS.
> <U00C4> "<U0041><U0308>";"<U0041><U0045>"
> ^^^^^^^^^^^^^^ NFD but this is apparently on purpose
Right, this is the sort of thing I was thinking of, where we want to
make sure to treat NFC and NFD forms of the same construct the same
for classification or collation or whatever. Another case is
localedata/locales/as_IN:140: string not normalized:
source: 09B9 09DF
nfc: 09B9 09AF 09BC
That's the 'yesstr'. U+09DF is BENGALI LETTER YYA, which is a
_noncanonical_ composition of U+09AF BENGALI LETTER YA with U+09BC
BENGALI SIGN NUKTA. The composed and decomposed forms render the same
on _my_ terminal, but maybe they don't on the terminals that the
community of Assamese speakers tends to use, or the decomposed form
doesn't convert properly to whatever the legacy encoding for this
language is (there is no %Charset: annotation in this file).
Regardless, we can't change it without doing some research first.
zw