Bug 18568

Summary: Update locale data to Unicode 8.0
Product: glibc Reporter: Joseph Myers <jsm28>
Component: localedataAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: aoliva, carlos, libc-locales, maiku.fabian, myllynen, pabs3, pravin.d.s
Priority: P2 Flags: fweimer: security-
Version: 2.21   
Target Milestone: 2.23   
Host: Target:
Build: Last reconfirmed:

Description Joseph Myers 2015-06-20 22:53:53 UTC
Now that Unicode 8.0 has been released, the locale data that was updated for Unicode 7.0 (bug 14094, bug 17588) should be updated to 8.0.  Hopefully, given the automation that was developed for the last update, this one should be a lot simpler to do.

ISO/IEC 10646:2014/Amd 1:2015 was published 2015-05-15.  So the __STDC_ISO_10646__ value should be 201505L.
Comment 1 Marko Myllynen 2015-06-22 04:55:49 UTC
The following Unicode Blog posting provides some comments about required implementation changes, however I haven't investigated how relevant these are from glibc perspective.

http://blog.unicode.org/2015/03/unicode-80-beta-review.html

Unicode 8.0.0 comprises several changes which require careful migration in implementations, including the conversion of Cherokee to a bicameral script, a different encoding model for New Tai Lue, and additional character repertoire. Implementers need to change code and check assumptions regarding case mappings New Tai Lue syllables, Han character ranges, and confusables.
Comment 2 Mike FABIAN 2015-06-22 17:28:38 UTC
I made a patch for this and posted it to the libc-alpha mailing list:

https://sourceware.org/ml/libc-alpha/2015-06/msg00748.html
Comment 3 Mike FABIAN 2015-06-22 17:31:26 UTC
(In reply to Marko Myllynen from comment #1)
> The following Unicode Blog posting provides some comments about required
> implementation changes, however I haven't investigated how relevant these
> are from glibc perspective.

I think for glibc, this has little impact.
 
> http://blog.unicode.org/2015/03/unicode-80-beta-review.html
> 
> Unicode 8.0.0 comprises several changes which require careful migration in
> implementations, including the conversion of Cherokee to a bicameral script,

That means Cherokee has upper and lower case now, lots of characters
get added to "toupper", "tolower", "totitle" in the "i18n" file.

> a different encoding model for New Tai Lue, and additional character
> repertoire.

Many New Tai Lue characters which were combining characters before are
not combining characters anymore.

> Implementers need to change code and check assumptions regarding
> case mappings New Tai Lue syllables, Han character ranges, and confusables.
Comment 4 Paul Wise 2015-06-24 07:42:39 UTC
What about removing the Unicode locale data from the glibc repository and tarball and creating the glibc versions from the Unicode data at build time? This would mean a new Unicode release would just need a rebuild of glibc to include the new data?
Comment 5 jsm-csl@polyomino.org.uk 2015-06-24 14:26:04 UTC
The results of a glibc build and install should depend only on the glibc 
source tree (preferably, be byte-for-byte reproducible in different 
builds), not on any network resources.  In addition, it's common for 
manual review and scripts changes to be needed as part of a Unicode 
update.
Comment 6 Paul Wise 2015-06-26 03:34:04 UTC
That should be possible, just require the Unicode data at build time, just like glibc requires gcc at build time. The Unicode data is already packaged in some distributions as a unicode-data package. No network resources needed.

Depending on the Unicode data instead of embedding copies of it doesn't prevent review and script changes.
Comment 7 Andreas Schwab 2015-06-26 06:26:21 UTC
You are missing the point.  glibc requires a fixed version of the unicode data which cannot be updated without updating the glibc sources.
Comment 8 Paul Wise 2015-06-27 03:22:00 UTC
Why is that?
Comment 9 Carlos O'Donell 2015-06-29 20:04:58 UTC
(In reply to Paul Wise from comment #8)
> Why is that?

Stability. We want to test and validate one version of Unicode data and update the implementation to match.
Comment 10 keld@keldix.com 2015-06-29 20:30:14 UTC
On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=18568
> 
> Carlos O'Donell <carlos at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |carlos at redhat dot com
> 
> --- Comment #9 from Carlos O'Donell <carlos at redhat dot com> ---
> (In reply to Paul Wise from comment #8)
> > Why is that?
> 
> Stability. We want to test and validate one version of Unicode data and update
> the implementation to match.

I think we should keep LC_CTYPE and LC_COLLATE data in  sync.
And then build LC_COLLATE on the latest approved ISO 14651 data.

Vbest regards
Keld
Comment 11 jsm-csl@polyomino.org.uk 2015-06-29 20:42:48 UTC
Collation is the subject of bug 14095, not this one (and as noted there, 
is probably a lot more work to update because of all the local changes 
over the past 15 years - though maybe once the initial catching up is 
done, future Unicode updates can then update everything together).
Comment 12 Paul Wise 2015-06-30 03:57:28 UTC
On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote:

> Stability. We want to test and validate one version of Unicode
> data and update the implementation to match.

Is that testing and validation automated or manual?
Comment 13 Carlos O'Donell 2015-06-30 13:53:43 UTC
(In reply to Paul Wise from comment #12)
> On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote:
> 
> > Stability. We want to test and validate one version of Unicode
> > data and update the implementation to match.
> 
> Is that testing and validation automated or manual?

In the past the testing and validation was entirely manual, which is why automatically migrating to a new version of Unicode was difficult and potentially error prone. With the recent update by the Fedora i18n team (Mike Fabian, Pravin Satpute) and with help of Alexandre Oliva (Red Hat), we were able to automate the update of the character encoding and ctype tables. This update included regression testing scripts that verify backwards compatibility automatically. So in some ways we have automatic testing for that component of the locales, but given the broader changes in Unicode 8.0, someone should do the work and verify the results.

If you want to try update to Unicode 8.0, you can do so by looking at localedata/unicode-gen/Makefile, updating the external properties files, and running `make all` in that unicode-gen directory (distinct from glibc build).
Comment 14 Mike FABIAN 2015-07-01 05:11:05 UTC
(In reply to Carlos O'Donell from comment #13)

> compatibility automatically. So in some ways we have automatic testing for
> that component of the locales, but given the broader changes in Unicode 8.0,
> someone should do the work and verify the results.
> 
> If you want to try update to Unicode 8.0, you can do so by looking at
> localedata/unicode-gen/Makefile, updating the external properties files, and
> running `make all` in that unicode-gen directory (distinct from glibc build).

My latest patch set doing the update to Unicode 8.0 (including updating
the translit_* files) is here:

https://sourceware.org/ml/libc-alpha/2015-06/msg00934.html
Comment 15 Mike FABIAN 2015-11-24 06:29:01 UTC
The latest patch is here: 

https://sourceware.org/ml/libc-alpha/2015-07/msg00836.html
Comment 16 Carlos O'Donell 2015-12-10 16:51:53 UTC
This is now fixed.

commit 23256f5ed889266223380c02b2750d19e3fe666b
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Thu Dec 10 00:30:51 2015 -0500

    Update to Unicode 8.0.0.
    
    Update __STDC_ISO_10646__ to 201505L for Unicode 8.0.0.
    Update character encoding, ctype, and transliteration tables.
    New scripts autogenerate transliteration tables.