This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Improved check-localedef script
On Fri, Aug 4, 2017 at 5:14 AM, Rafal Luzynski
<digitalfreak@lingonborough.com> wrote:
> 3.08.2017 23:17 Zack Weinberg <zackw@panix.com> wrote:
>> localedata/locales/br_FR... (charset: iso8859-1)
>> localedata/locales/br_FR:122: string not representable in iso8859-1:
>> 006D 0065 0072 0063 02BC 0068 0065 0072
>> [...]
>
> Most probably this is because of <U02BC> which is a Unicode apostrophe.
> In order to be representable in iso8859-1 it needs to be converted
> to an ASCII apostrophe <U0027>. Can we please have this in the conversion
> script? This is really necessary as br_FR must be converted to both
> UTF-8 and ISO 88859-1.
Just to clarify, what I have written is not intended to be any sort of
_conversion_ script. It is intended to be a _test_ script, which
will, once we get all the errors ironed out, run as part of "make
check" to ensure that new encoding-related mistakes do not appear in
the locales.
Now, what I think you're trying to say here is that it is okay to use
<U02BC> in br_FR because, when localedef generates the legacy
iso-8859-1 version of the locale, it will transliterate that to the
ASCII apostrophe. Unfortunately, Python (as of 3.6)'s codecs have no
equivalent of the //translit mechanism in glibc's iconv, so I don't
(right now, anyway) see any way the script could know that. I'm open
to suggestions.
>> localedata/locales/da_DK... (charset: iso8859-1)
>> localedata/locales/da_DK:145: string not representable in iso8859-1:
>> 0041 0308
>
> This is false positive: 0308 is a combining diaeresis character so
> 0041 0308 produces A with diaeresis (Ä) which is representable in
> iso8859-1 as C4. Even diaeresis standalone is representable as A8.
This is a similar issue. Python's codecs will not attempt to
renormalize a character sequence before encoding it.
>>> "\u00C4".encode("iso-8859-1")
b'\xc4'
>>> unicodedata.normalize("NFC", "\u0041\u0308").encode("iso-8859-1")
b'\xc4'
>>> "\u0041\u0308".encode("iso-8859-1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0308' in
position 1: ordinal not in range(256)
Perhaps I should go back to throwing errors on all non-NFC strings? I
changed the script to allow NFD as well because it seemed like at
least some instances of NFD were intentional, but...
zw