Summary: | Update locale data to Unicode 8.0 | ||
---|---|---|---|
Product: | glibc | Reporter: | Joseph Myers <jsm28> |
Component: | localedata | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | aoliva, carlos, libc-locales, maiku.fabian, myllynen, pabs3, pravin.d.s |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | 2.21 | ||
Target Milestone: | 2.23 | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Joseph Myers
2015-06-20 22:53:53 UTC
The following Unicode Blog posting provides some comments about required implementation changes, however I haven't investigated how relevant these are from glibc perspective. http://blog.unicode.org/2015/03/unicode-80-beta-review.html Unicode 8.0.0 comprises several changes which require careful migration in implementations, including the conversion of Cherokee to a bicameral script, a different encoding model for New Tai Lue, and additional character repertoire. Implementers need to change code and check assumptions regarding case mappings New Tai Lue syllables, Han character ranges, and confusables. I made a patch for this and posted it to the libc-alpha mailing list: https://sourceware.org/ml/libc-alpha/2015-06/msg00748.html (In reply to Marko Myllynen from comment #1) > The following Unicode Blog posting provides some comments about required > implementation changes, however I haven't investigated how relevant these > are from glibc perspective. I think for glibc, this has little impact. > http://blog.unicode.org/2015/03/unicode-80-beta-review.html > > Unicode 8.0.0 comprises several changes which require careful migration in > implementations, including the conversion of Cherokee to a bicameral script, That means Cherokee has upper and lower case now, lots of characters get added to "toupper", "tolower", "totitle" in the "i18n" file. > a different encoding model for New Tai Lue, and additional character > repertoire. Many New Tai Lue characters which were combining characters before are not combining characters anymore. > Implementers need to change code and check assumptions regarding > case mappings New Tai Lue syllables, Han character ranges, and confusables. What about removing the Unicode locale data from the glibc repository and tarball and creating the glibc versions from the Unicode data at build time? This would mean a new Unicode release would just need a rebuild of glibc to include the new data? The results of a glibc build and install should depend only on the glibc source tree (preferably, be byte-for-byte reproducible in different builds), not on any network resources. In addition, it's common for manual review and scripts changes to be needed as part of a Unicode update. That should be possible, just require the Unicode data at build time, just like glibc requires gcc at build time. The Unicode data is already packaged in some distributions as a unicode-data package. No network resources needed. Depending on the Unicode data instead of embedding copies of it doesn't prevent review and script changes. You are missing the point. glibc requires a fixed version of the unicode data which cannot be updated without updating the glibc sources. Why is that? (In reply to Paul Wise from comment #8) > Why is that? Stability. We want to test and validate one version of Unicode data and update the implementation to match. On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=18568
>
> Carlos O'Donell <carlos at redhat dot com> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |carlos at redhat dot com
>
> --- Comment #9 from Carlos O'Donell <carlos at redhat dot com> ---
> (In reply to Paul Wise from comment #8)
> > Why is that?
>
> Stability. We want to test and validate one version of Unicode data and update
> the implementation to match.
I think we should keep LC_CTYPE and LC_COLLATE data in sync.
And then build LC_COLLATE on the latest approved ISO 14651 data.
Vbest regards
Keld
Collation is the subject of bug 14095, not this one (and as noted there, is probably a lot more work to update because of all the local changes over the past 15 years - though maybe once the initial catching up is done, future Unicode updates can then update everything together). On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote:
> Stability. We want to test and validate one version of Unicode
> data and update the implementation to match.
Is that testing and validation automated or manual?
(In reply to Paul Wise from comment #12) > On Mon, Jun 29, 2015 at 08:04:58PM +0000, carlos at redhat dot com wrote: > > > Stability. We want to test and validate one version of Unicode > > data and update the implementation to match. > > Is that testing and validation automated or manual? In the past the testing and validation was entirely manual, which is why automatically migrating to a new version of Unicode was difficult and potentially error prone. With the recent update by the Fedora i18n team (Mike Fabian, Pravin Satpute) and with help of Alexandre Oliva (Red Hat), we were able to automate the update of the character encoding and ctype tables. This update included regression testing scripts that verify backwards compatibility automatically. So in some ways we have automatic testing for that component of the locales, but given the broader changes in Unicode 8.0, someone should do the work and verify the results. If you want to try update to Unicode 8.0, you can do so by looking at localedata/unicode-gen/Makefile, updating the external properties files, and running `make all` in that unicode-gen directory (distinct from glibc build). (In reply to Carlos O'Donell from comment #13) > compatibility automatically. So in some ways we have automatic testing for > that component of the locales, but given the broader changes in Unicode 8.0, > someone should do the work and verify the results. > > If you want to try update to Unicode 8.0, you can do so by looking at > localedata/unicode-gen/Makefile, updating the external properties files, and > running `make all` in that unicode-gen directory (distinct from glibc build). My latest patch set doing the update to Unicode 8.0 (including updating the translit_* files) is here: https://sourceware.org/ml/libc-alpha/2015-06/msg00934.html The latest patch is here: https://sourceware.org/ml/libc-alpha/2015-07/msg00836.html This is now fixed. commit 23256f5ed889266223380c02b2750d19e3fe666b Author: Mike FABIAN <mfabian@redhat.com> Date: Thu Dec 10 00:30:51 2015 -0500 Update to Unicode 8.0.0. Update __STDC_ISO_10646__ to 201505L for Unicode 8.0.0. Update character encoding, ctype, and transliteration tables. New scripts autogenerate transliteration tables. |