This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] [BZ 14094] Update locale data to Unicode 7.0.0
- From: "Joseph S. Myers" <joseph at codesourcery dot com>
- To: Pravin Satpute <psatpute at redhat dot com>
- Cc: <libc-alpha at sourceware dot org>, Carlos O'Donell <carlos at redhat dot com>
- Date: Sat, 21 Jun 2014 21:04:30 +0000
- Subject: Re: [PATCH] [BZ 14094] Update locale data to Unicode 7.0.0
- Authentication-results: sourceware.org; auth=none
- References: <53A5DCA3 dot 4010108 at redhat dot com>
On Sun, 22 Jun 2014, Pravin Satpute wrote:
> Hi All,
> Attached patch to fix this long pending issue.
Thanks for working on this issue.
> A. Process for updating locales/i18n ctype with new Unicode release is
> documented @ , I think it should get added either in WIKI, or docs
> folder of glibc.
The process should ideally be running a single command - no manual editing
at all. (That command might be a script that wraps some other commands.)
If tempted to write instructions for running a sequence of commands and
editing the result, writing a script to automate that is better.
> B. Patch adds two scripts to scripts folder and updates locales/i18n file
> gen-unicode-ctype-dcp.py - To generate upper, lower and alpha class
> from DerivedCoreProperties.txt 
> check-backcompatibility.py - to test whether updated locales/i18n
> is backward compatible with older one.
> C. Best way to check new updated i18n file is compatible with existing
> i18n file is
> a. copy new i18n file as a i18nnew
> b. check-backcompatibility.py i18n i18nnew > Report
> D. By using better file DerivedCoreProperties.txt from UCD for
> generating CTYPE, we found number of characters were improperly mapped
> to 'alpha' categories.
> Report/Analysis for backward compatibility is available AT
> backward-compatibility5_1-to-7_0 
That report is a very useful starting point, but doesn't seem to explain
things at the human level. What changes have there been to previously
supported characters, and why, in terms of Unicode character properties,
are those changes correct changes? Maybe something more verbose that
names the characters individually and states what the old ctype
information was, and what the new information is, and what the relevant
Unicode proeprties are that explain the new information, would help.
You're changing how upper/lower/alpha properties are generated. Does that
fix bug 14010? If so, you can include [BZ #14010] in your ChangeLog
entry. Does it obsolete the special cases in
gen-unicode-ctype.c:is_alpha? If so, you should remove the parts of
gen-unicode-ctype.c that are no longer used. You should also confirm that
each of the special cases there is properly handled by the new logic - or
state explicitly that the handling of certain identified characters with
special cases is being deliberately changed, because the Unicode
properties for those characters are better than the special-case handling.
> diff --git a/include/stdc-predef.h b/include/stdc-predef.h
> index 87e3666..f96d308 100644
> --- a/include/stdc-predef.h
> +++ b/include/stdc-predef.h
> @@ -50,8 +50,9 @@
> /* wchar_t uses ISO/IEC 10646 (2nd ed., published 2011-03-15) /
> - Unicode 6.0. */
> -#define __STDC_ISO_10646__ 201103L
> + Unicode 6.0.
> + Unicode 7.0.0 Published on 2014 June 16 */
> +#define __STDC_ISO_10646__ 201406L
The date is meant to correspond to ISO/IEC 10646 publication dates, not
Unicode publication dates.
Now, the most recent published amendment is amendment 1 from 2013-04-15
(Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan,
and other characters). WG2 N4566 states an intent for Unicode 7.0 to
synchronize with amendment 2 to the 2012 edition of ISO/IEC 10646.
However, I can't locate a proposed publication date for that amendment (or
for the 2014 edition of ISO/IEC 10646 - and work appears to be underway on
amendments 1 and 2 to the 2014 edition, even before it's published). So
maybe put 201304L there until such an amendment is published.
> diff --git a/scripts/check-backcompatibility.py b/scripts/check-backcompatibility.py
> new file mode 100755
> index 0000000..a56ac0a
> --- /dev/null
> +++ b/scripts/check-backcompatibility.py
I think in scripts/ the name should be more specific about *what* is
having compatibility checked - scripts/ is for all of glibc, not just
> +# Copyright (C) 2013-14, Pravin Satpute <firstname.lastname@example.org>
glibc contributions should be assigned to the FSF (and miscellaneous
programs would normally by GPLv2+ / LGPLv2.1+ unless there is some reason
to deviate from the norm for such programs in glibc).
Joseph S. Myers