This is sources Bugzilla
Bugzilla Version 2.17.5
Bugzilla Bug 672
  Include iso14651_t1 in collation rules Last modified: 2007-02-18 04:34:54
     Query page      Enter new bug
Bug#: 672   Hardware:   Reporter: Denis Barbier <barbier@linuxfr.org>
Host: Target: Build:
Product:     Add CC:
Component:   Version:   CC:
Remove selected CCs
Status: RESOLVED   Priority:  
Resolution: FIXED   Severity:  
Assigned To: Petter Reinholdtsen <pere@hungry.com>   Target Milestone:  
Flags: Requestee:
  backport ()
  examined ()
  testsuite ()
Summary:
Keywords:

Attachment Description Type Created Actions
collate-iso.patch Patch to include iso14651_t1 in LC_COLLATE patch 2005-01-16 08:24 Edit | Diff
tst-show-table-sorted.c Program to display and sort all combinations of 2 characters text/plain 2005-10-26 22:11 Edit None
test-collate.sh Script to compare output of tst-show-table-sorted with original and patched locales text/plain 2005-10-26 22:12 Edit None
iso14651_t1 improved iso14651_t1 file text/plain 2006-05-12 10:05 Edit None
iso14651_t1 improved iso14651_t1 file text/plain 2006-05-12 14:32 Edit None
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 672 depends on: Show dependency tree
Show dependency graph
Bug 672 blocks:

Additional Comments:


Leave as RESOLVED FIXED
Reopen bug
Mark bug as VERIFIED

View Bug Activity   |   Format For Printing


Description:   Last confirmed: 0000-00-00 00:00 Opened: 2005-01-16 08:22
The rationale was given by Pablo in BZ#664.

I tested that sequences of 2 'alnum' characters produce the same
sorted output.  Ligatures and expanded characters have different
weights, so there are some minor changes when checking with more
than 2 characters.
Extra rules are added to mimic current behavior, I did not fix any
supposed errors.

------- Additional Comment #1 From Denis Barbier 2005-01-16 08:24 -------
Created an attachment (id=368)
Patch to include iso14651_t1 in LC_COLLATE

------- Additional Comment #2 From Ulrich Drepper 2005-09-24 19:06 -------
How did you verify nothing changed?

------- Additional Comment #3 From Denis Barbier 2005-10-26 22:08 -------
> How did you verify nothing changed?

I used the attached files to check differences:
  * tst-show-table-sorted.c contains 2 loops to print 2 characters per
    line, and sort them according to the current locale.  Only
    non-ignorable and alphanumeric characters are taken into account.
  * test-collate.sh
    + applies collate-iso.patch
    + modifies iso14651_t1 so that include "iso14651_t1" gives the same
      ruleset as in original locale files (this is a workaround for BZ645)
    + compiles original and patched locales
    + runs tst-show-table-sorted with these locales
    + compares output

The only differences are with 
   <U00AA>: FEMININE ORDINAL INDICATOR
   <U00BA>: MASCULINE ORDINAL INDICATOR
   <U00DF>: LATIN SMALL LETTER SHARP S
Some locales have also differences with respect to
   <U00D0>: LATIN CAPITAL LETTER ETH
   <U00F0>: LATIN SMALL LETTER ETH
   <U00DE>: LATIN CAPITAL LETTER THORN
   <U00FE>: LATIN SMALL LETTER THORN
but in such cases, these characters are not commonly used for this locale.
See the end of test-collate.sh for exhaustive results.

------- Additional Comment #4 From Denis Barbier 2005-10-26 22:11 -------
Created an attachment (id=728)
Program to display and sort all combinations of 2 characters

------- Additional Comment #5 From Denis Barbier 2005-10-26 22:12 -------
Created an attachment (id=729)
Script to compare output of tst-show-table-sorted with original and patched
locales

------- Additional Comment #6 From Pablo Saratxaga 2006-05-12 10:05 -------
Created an attachment (id=1018)
improved iso14651_t1 file

improved iso14651_t1 file; changes are:
- converted to UTF-8 (for text in comments)
- added Armenian script block, with proper sorting
- added Tifinagh script block
- added a whole lot of latin and cyrillic script letters,
  so they are "properly" sorted (not at random positions
  before "0" or after "z", but, for example, "e with dot below"
  sorted as "e", etc.

------- Additional Comment #7 From Pablo Saratxaga 2006-05-12 10:17 -------
The use of iso14651_t1 by default then only redefine or add some local rules if
needed is indeed much better than redefine everyhing in a locale; as the things
redefined are much smaller, it helps understand the important rules, and more
easily detect errors and correct them.
Also, it also allow sorting in a predictable way the characters out of the scope
of the locale, which is a very nice thing to have.

I attached an improved iso14651_t1 that adds a lot of other latin and cyrillic
characters that were missing, so they get sorted too; it also handles double
accented letters (like in vietnamese); and adds armenian and tifinagh script
blocks; considet de t/s with cedilla and t/s with comma below as synonyms for
sorting and made digraphs (as opposed to ligatures) as synonyms of the base
letters for sorting.

It provides a much better default collating set.
Note that with the exception of t/s with cedilla and t/s with comma below and
the digraphs (which are unicode compatibility stuff and should never be typed
directly btw), I mainly only added new, previously ignored, characters.

The main advantages of that modified file are a proper (or at least, quite
acceptable) sorting, when using a generic (eg not specific to that language)
locale; in particular when sorting words from Armenian, Vietnamese, African or
Native American languages written in latin script, languages of former USSR
written in cyrillic script.

------- Additional Comment #8 From Pablo Saratxaga 2006-05-12 14:32 -------
Created an attachment (id=1019)
improved iso14651_t1 file

(fixed small problem (there were two defined symbols that were unused)

------- Additional Comment #9 From Ulrich Drepper 2007-02-18 04:34 -------
I've added the latest iso14651_t1 and then changed the locale definitions. 
Please check whether this iis all that's needed.

     Query page      Enter new bug
Actions: New | Query | bug # | Reports | Requests   New Account | Log In