This is sources Bugzilla
Bugzilla Version 2.17.5
Bugzilla Bug 5521
  Collation: "#" < "a" < "#a"; special characters and wrong sorting Last modified: 2008-01-11 00:28
     Query page      Enter new bug
Bug#: 5521   Hardware:   Reporter: Maciej Blizi&#324;ski <maciej.blizinski+sources-bugzilla@gmail.com>
Host: Target: Build:
Product:     Add CC:
Component:   Version:   CC:
Remove selected CCs
Status: RESOLVED   Priority:  
Resolution: INVALID   Severity:  
Assigned To: GNU C LIbrary Locale Maintainers <libc-locales@sources.redhat.com>   Target Milestone:  
Flags: Requestee:
  backport ()
  examined ()
  testsuite ()
Summary:
Keywords:

Attachment Description Type Created Actions
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 5521 depends on: Show dependency tree
Show dependency graph
Bug 5521 blocks:

Additional Comments:


Leave as RESOLVED INVALID
Reopen bug
Mark bug as VERIFIED

View Bug Activity   |   Format For Printing


Description:   Last confirmed: 0000-00-00 00:00 Opened: 2007-12-24 14:33
In C locale, sorting looks like this:

maciej@debian:~$ echo -e 'a\n \n a\n#\n#a\n@\n@a' | LC_COLLATE=C sort | sed -e
's/.*/"&"/'
" "
" a"
"#"
"#a"
"@"
"@a"
"a"

However, in en_US.UTF-8, sorting looks like this:

maciej@debian:~$ echo -e 'a\n \n a\n#\n#a\n@\n@a' | LC_COLLATE=en_US.UTF-8 sort
| sed -e 's/.*/"&"/'
" "
"@"
"#"
"a"
" a"
"@a"
"#a"

I believe that this is wrong. I've observed it on many many hosts, with
different Linux distributions.

------- Additional Comment #1 From Petter Reinholdtsen 2007-12-24 15:06 -------
You can of course believe anything you want, but to have the presumed
correct sorting order in glibc changed, you will have to provide
some reference or source that can explain why it is wrong.
The C sorting order is according to character code value, while most langauges
sort according to linguistic rules.  The sorting order for en_US is
according to linguistic rules, and I believe it is correct.  Only letters
and numbers are considered on the first level, and space and non-letter
characters are considered if the first level is identical.

------- Additional Comment #2 From Maciej Blizi&#324;ski 2007-12-24 16:51 -------
I've found a relevant page here:
http://www.unicode.org/unicode/reports/tr10/#Common_Misperceptions

Collation order is not preserved under concatenation or substring operations, in
general. For example, the fact that x is less than y does not mean that x + z is
less than y + z. This is because characters may form contractions across the
substring or concatenation boundaries. In summary, the following shows which
implications not to expect.

x < y ↛ xz < yz
x < y ↛ zx < zy
xz < yz ↛ x < y
zx < zy ↛ x < y

* * *

Perhaps I've fallen into the misperception number one: x < y ? xz < yz. I'll
need to read this document carefully and understand why doesn't x < y imply xz < yz.

------- Additional Comment #3 From Ulrich Drepper 2008-01-11 00:28 -------
The traditional C locale ordering has nothing to do with how the world really
does sorting.  Ask your local librarian.  There is nothing wrong here.

     Query page      Enter new bug
Actions: New | Query | bug # | Reports | Requests   New Account | Log In