This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug locale/21302] strcoll does not correctly follow locale-specified order in some cases


https://sourceware.org/bugzilla/show_bug.cgi?id=21302

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2017-10-28
           Assignee|unassigned at sourceware dot org   |carlos at redhat dot com
     Ever confirmed|0                           |1

--- Comment #11 from Carlos O'Donell <carlos at redhat dot com> ---
OK, I have fixed the code-point collation sorting issue.

There are 2 problems:

(a) The collation table builder and thus the weights ignores characters in the
collation specification if they do not exactly match the hash of the symbolic
name from the charmap. This is arguably a QoI issue, but it needs an explicit
warning for all UTF-8 locales to catch typos in the collation tables.

(b) Since the UTF-8 charmap uses 4 or 8 character code point names, the
collation must also use *identically* matching symbols or those symbols are
silently ignored and have no weights. This is where the Debian and Fedora
collations got it wrong, effectively we have giant ranges of typos (and
ellipsis generating typos in the thousands) that do not have correct weights.

Once I added the new warnings for (a), I could find all the problems with the
locale file and fix (b).

To solve this I'm adding a new --warning=missingcollchar warning which I plan
to turn on for all locales being compiled with UTF-8, it will also be turned on
by verbose, such that users can see these warnings when developing a locale. We
cannot turn them on by default because it is entirely allowed to have a
collation sequence whose characters may not exist in the charmap you are using,
and so can be safely ignored.

After that I'm going to send my C.UTF-8 patch upstream for review so all the
distros can have a harmonized C.UTF-8 to use with correct collation.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]