Bug 14095 - Review / update collation data from Unicode / ISO 14651
Summary: Review / update collation data from Unicode / ISO 14651
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.15
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-10 20:32 UTC by Joseph Myers
Modified: 2016-02-19 17:14 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joseph Myers 2012-05-10 20:32:11 UTC
The localedata/locales/iso14651_t1_* files are probably, from their names, originally based on some version of ISO 14651 collation data.  They should be updated if possible to be based on the current Unicode collation data and algorithms.

http://www.unicode.org/reports/tr10/

Since there have been a lot of changes to these files since the original addition in

2000-05-24  Ulrich Drepper  <drepper@redhat.com>

        * locales/iso14651_t1: New file.

it's likely there will be a lot of work to understand how the files relate to ISO 14651 and what local changes are still relevant.
Comment 1 Paul Wise 2015-06-30 03:52:47 UTC
Why did glibc fork the Unicode collation data instead of sending changes upstream?
Comment 2 joseph@codesourcery.com 2015-06-30 11:14:35 UTC
The people involved in getting the collation data to its present state are 
mostly no longer involved in glibc development, so if you want an 
authoritative answer you'll need to do a lot of work tracking them down.  
My hypothesis would be that each person submitting a change generally had 
their own itch to scratch (supporting collation for their own language 
better, with no interest in a more general update to a newer version of 
ISO 14651, if a newer version even existed at that time, or insufficient 
time / expertise / resources to get involved in their national standards 
committees parallel to JTC1/SC2/WG2, if ISO 14651 did not support their 
language then) and that each person accepting such a change decided that 
it was better to have the incremental improvement than to have no 
collation support for that language for the indefinite future until 
someone appeared to contribute a more thorough update.

We don't, however, need to know people's motivations for making 
incremental changes rather than larger bulk updates.  The questions that 
are actually relevant for updating the data now are more along the lines 
of: for the original addition of the ISO 14651 data, what differences are 
there from the relevant version of ISO 14651?  Do those differences relate 
to conceptual differences between the POSIX collation model and the ISO 
14651 collation model, or do they reflect different choices for how to 
collate particular characters?  If they reflect different choices, do we 
still agree that those choices are appropriate for the contexts in which 
glibc locales are used, or, with hindsight, would the ISO 14651 choices 
now be better?  Where a change was made subsequently affecting existing 
characters, is the change still at variance with current ISO 14651, and do 
we think there is still a good reason for such a difference?  Where 
collation support for new characters was added, how does that support 
compare to the support, if any, for those characters in current ISO 14651, 
and are there any differences we think are deliberate and should be 
preserved?  Do any differences reflect cases where e.g. different national 
standards specify different collation for the same characters (or 
collation differs by context), and so individual locales may need to 
override the generic international version?

Yes, there is a lot of detailed, careful work involved in analysis of the 
history of the current collation data in order to produce a justified 
analysis of those questions with recommendations for how to use data from 
current ISO 14651.  Given the responsibility to users to avoid 
regressions, we need to understand what changes would be involved in such 
an update, and satisfy ourselves that they are good changes rather than 
regressions, as part of making such an update.  Contributors willing to 
help with that careful analysis are welcome.
Comment 3 Carlos O'Donell 2015-06-30 13:44:55 UTC
(In reply to joseph@codesourcery.com from comment #2)
> Yes, there is a lot of detailed, careful work involved in analysis of the 
> history of the current collation data in order to produce a justified 
> analysis of those questions with recommendations for how to use data from 
> current ISO 14651.  Given the responsibility to users to avoid 
> regressions, we need to understand what changes would be involved in such 
> an update, and satisfy ourselves that they are good changes rather than 
> regressions, as part of making such an update.  Contributors willing to 
> help with that careful analysis are welcome.

I agree completely with Joseph.
Comment 4 keld@keldix.com 2015-06-30 15:29:40 UTC
On Tue, Jun 30, 2015 at 11:14:35AM +0000, joseph at codesourcery dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=14095
> 
> --- Comment #2 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
> The people involved in getting the collation data to its present state are 
> mostly no longer involved in glibc development, so if you want an 
> authoritative answer you'll need to do a lot of work tracking them down.  
> My hypothesis would be that each person submitting a change generally had 
> their own itch to scratch (supporting collation for their own language 
> better, with no interest in a more general update to a newer version of 
> ISO 14651, if a newer version even existed at that time, or insufficient 
> time / expertise / resources to get involved in their national standards 
> committees parallel to JTC1/SC2/WG2, if ISO 14651 did not support their 
> language then) and that each person accepting such a change decided that 
> it was better to have the incremental improvement than to have no 
> collation support for that language for the indefinite future until 
> someone appeared to contribute a more thorough update.
> 
> We don't, however, need to know people's motivations for making 
> incremental changes rather than larger bulk updates.  The questions that 
> are actually relevant for updating the data now are more along the lines 
> of: for the original addition of the ISO 14651 data, what differences are 
> there from the relevant version of ISO 14651?  Do those differences relate 
> to conceptual differences between the POSIX collation model and the ISO 
> 14651 collation model, or do they reflect different choices for how to 
> collate particular characters?  If they reflect different choices, do we 
> still agree that those choices are appropriate for the contexts in which 
> glibc locales are used, or, with hindsight, would the ISO 14651 choices 
> now be better?  Where a change was made subsequently affecting existing 
> characters, is the change still at variance with current ISO 14651, and do 
> we think there is still a good reason for such a difference?  Where 
> collation support for new characters was added, how does that support 
> compare to the support, if any, for those characters in current ISO 14651, 
> and are there any differences we think are deliberate and should be 
> preserved?  Do any differences reflect cases where e.g. different national 
> standards specify different collation for the same characters (or 
> collation differs by context), and so individual locales may need to 
> override the generic international version?
> 
> Yes, there is a lot of detailed, careful work involved in analysis of the 
> history of the current collation data in order to produce a justified 
> analysis of those questions with recommendations for how to use data from 
> current ISO 14651.  Given the responsibility to users to avoid 
> regressions, we need to understand what changes would be involved in such 
> an update, and satisfy ourselves that they are good changes rather than 
> regressions, as part of making such an update.  Contributors willing to 
> help with that careful analysis are welcome.

Well, I was the author of many of the collation specs for different
languages, and I am still around, and I have even joined glibc maintenance
just a few years ago.

The 14651 and POSIX model are the same, or 14651 is backwards compatible
with Posix. We cannot say that we are following POSIX straightly,
then we could not have locales working, as POSIX is not well suited for
ISO 10646 UCS. So we are not adhering to POSIX, but rather 14651.

The different locale collation data were designed to adhere to
14651, in an orthogonal way, just like 14651 was designed to be used.

I am willing to contribute with a look on the different issues.

Best regards
Keld
Comment 5 joseph@codesourcery.com 2015-06-30 16:03:54 UTC
On Tue, 30 Jun 2015, keld at keldix dot com wrote:

> I am willing to contribute with a look on the different issues.

That would be very helpful, thanks!  The first question would probably be 
where the original iso14651_t1 file (added in commit 
b0a3e2e6238f4846bc7a99145d2721b8d5b5ec31 in the history repository) came 
from; if we can reproduce it from old ISO 14651 data, we can hopefully 
build a corresponding file from current ISO 14651 data - and then start to 
understand, for all the changes made to the data over the past 15 years, 
which of them are still relevant and desirable given current ISO 14651 / 
Unicode data as a base, and what the right way is to handle those changes.
Comment 6 keld@keldix.com 2015-07-01 07:58:28 UTC
On Tue, Jun 30, 2015 at 04:03:54PM +0000, joseph at codesourcery dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=14095
> 
> --- Comment #5 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
> On Tue, 30 Jun 2015, keld at keldix dot com wrote:
> 
> > I am willing to contribute with a look on the different issues.
> 
> That would be very helpful, thanks!  The first question would probably be 
> where the original iso14651_t1 file (added in commit 
> b0a3e2e6238f4846bc7a99145d2721b8d5b5ec31 in the history repository) came 
> from; if we can reproduce it from old ISO 14651 data, we can hopefully 
> build a corresponding file from current ISO 14651 data - and then start to 
> understand, for all the changes made to the data over the past 15 years, 
> which of them are still relevant and desirable given current ISO 14651 / 
> Unicode data as a base, and what the right way is to handle those changes.

It is my plan to work with the editor of 14651 on making the 14651
data directly useable with glibc. This is not currently the case
and we know it.

Keld
Comment 7 Mike Frysinger 2016-02-19 07:05:37 UTC
any update ?  we've got these shiny new unicode-gen/ python scripts for importing unicode data ...
Comment 8 joseph@codesourcery.com 2016-02-19 17:14:32 UTC
I expect reviewing the sources of and past changes to collation data, and 
writing suitable scripts to reproduce it from old upstream data / 
regenerate it from new upstream data, taking due account of any deliberate 
differences, to be substantially more work than the update of other data 
from Unicode was.