View Bug Activity | Format For Printing
In C locale, sorting looks like this: maciej@debian:~$ echo -e 'a\n \n a\n#\n#a\n@\n@a' | LC_COLLATE=C sort | sed -e 's/.*/"&"/' " " " a" "#" "#a" "@" "@a" "a" However, in en_US.UTF-8, sorting looks like this: maciej@debian:~$ echo -e 'a\n \n a\n#\n#a\n@\n@a' | LC_COLLATE=en_US.UTF-8 sort | sed -e 's/.*/"&"/' " " "@" "#" "a" " a" "@a" "#a" I believe that this is wrong. I've observed it on many many hosts, with different Linux distributions.
You can of course believe anything you want, but to have the presumed correct sorting order in glibc changed, you will have to provide some reference or source that can explain why it is wrong. The C sorting order is according to character code value, while most langauges sort according to linguistic rules. The sorting order for en_US is according to linguistic rules, and I believe it is correct. Only letters and numbers are considered on the first level, and space and non-letter characters are considered if the first level is identical.
I've found a relevant page here: http://www.unicode.org/unicode/reports/tr10/#Common_Misperceptions Collation order is not preserved under concatenation or substring operations, in general. For example, the fact that x is less than y does not mean that x + z is less than y + z. This is because characters may form contractions across the substring or concatenation boundaries. In summary, the following shows which implications not to expect. x < y ↛ xz < yz x < y ↛ zx < zy xz < yz ↛ x < y zx < zy ↛ x < y * * * Perhaps I've fallen into the misperception number one: x < y ? xz < yz. I'll need to read this document carefully and understand why doesn't x < y imply xz < yz.
The traditional C locale ordering has nothing to do with how the world really does sorting. Ask your local librarian. There is nothing wrong here.