View Bug Activity | Format For Printing
The rationale was given by Pablo in BZ#664. I tested that sequences of 2 'alnum' characters produce the same sorted output. Ligatures and expanded characters have different weights, so there are some minor changes when checking with more than 2 characters. Extra rules are added to mimic current behavior, I did not fix any supposed errors.
Created an attachment (id=368) Patch to include iso14651_t1 in LC_COLLATE
How did you verify nothing changed?
> How did you verify nothing changed? I used the attached files to check differences: * tst-show-table-sorted.c contains 2 loops to print 2 characters per line, and sort them according to the current locale. Only non-ignorable and alphanumeric characters are taken into account. * test-collate.sh + applies collate-iso.patch + modifies iso14651_t1 so that include "iso14651_t1" gives the same ruleset as in original locale files (this is a workaround for BZ645) + compiles original and patched locales + runs tst-show-table-sorted with these locales + compares output The only differences are with <U00AA>: FEMININE ORDINAL INDICATOR <U00BA>: MASCULINE ORDINAL INDICATOR <U00DF>: LATIN SMALL LETTER SHARP S Some locales have also differences with respect to <U00D0>: LATIN CAPITAL LETTER ETH <U00F0>: LATIN SMALL LETTER ETH <U00DE>: LATIN CAPITAL LETTER THORN <U00FE>: LATIN SMALL LETTER THORN but in such cases, these characters are not commonly used for this locale. See the end of test-collate.sh for exhaustive results.
Created an attachment (id=728) Program to display and sort all combinations of 2 characters
Created an attachment (id=729) Script to compare output of tst-show-table-sorted with original and patched locales
Created an attachment (id=1018) improved iso14651_t1 file improved iso14651_t1 file; changes are: - converted to UTF-8 (for text in comments) - added Armenian script block, with proper sorting - added Tifinagh script block - added a whole lot of latin and cyrillic script letters, so they are "properly" sorted (not at random positions before "0" or after "z", but, for example, "e with dot below" sorted as "e", etc.
The use of iso14651_t1 by default then only redefine or add some local rules if needed is indeed much better than redefine everyhing in a locale; as the things redefined are much smaller, it helps understand the important rules, and more easily detect errors and correct them. Also, it also allow sorting in a predictable way the characters out of the scope of the locale, which is a very nice thing to have. I attached an improved iso14651_t1 that adds a lot of other latin and cyrillic characters that were missing, so they get sorted too; it also handles double accented letters (like in vietnamese); and adds armenian and tifinagh script blocks; considet de t/s with cedilla and t/s with comma below as synonyms for sorting and made digraphs (as opposed to ligatures) as synonyms of the base letters for sorting. It provides a much better default collating set. Note that with the exception of t/s with cedilla and t/s with comma below and the digraphs (which are unicode compatibility stuff and should never be typed directly btw), I mainly only added new, previously ignored, characters. The main advantages of that modified file are a proper (or at least, quite acceptable) sorting, when using a generic (eg not specific to that language) locale; in particular when sorting words from Armenian, Vietnamese, African or Native American languages written in latin script, languages of former USSR written in cyrillic script.
Created an attachment (id=1019) improved iso14651_t1 file (fixed small problem (there were two defined symbols that were unused)
I've added the latest iso14651_t1 and then changed the locale definitions. Please check whether this iis all that's needed.