[root@localhost home]# LC_COLLATE=en_US ls -- 0 a A -a a- aa "a a" a-a "a z" 0 a -a a- A aa a a a-a a z [root@localhost home]# LC_COLLATE=en_CA ls -- 0 a A -a a- aa "a a" a-a "a z" 0 A a -a a- aa a a a-a a z [root@localhost home]# LC_COLLATE=da ls -- 0 a A -a a- aa "a a" a-a "a z" -a 0 A a a a a z a- a-a aa [root@localhost home]# LC_COLLATE=ar_SA ls -- 0 a A -a a- aa "a a" a-a "a z" 0 A a a a a z aa a- a-a -a da: (the character "-" has a 1st order sorting value, coming before letters and numbers; on most other locales "-" is ignored in sorting) ar_SA: (note how ar_SA handles "-" as a collatable element coming after "z")
Please describe what the problem is. At least ISO/IEC defines some locales (like en_US) collation that says a capital and small letter is combined; a A b B ... and so on. BTW, what is locale "da"? Execute "locale -a" and check whether "da" is available or not.
Some hints: 1. There should be no difference between en_US and en_CA. 2. de (sorry not da) sorting is very odd. (the character "-" has a 1st order sorting value, coming before letters and numbers; on most other locales "-" is ignored in sorting) 3. ar_SA handles "-" as a collatable element coming after "z". ar_SA defines LC_COLLATE using an old syntax (with only one level of collating weight); so maybe this special weight for "-" wasn't intended to be like that; just a side-effet. Maybe the LC_COLLATE section should be redefined to use the default one and only redefine (if needed) the sorting of arabic script letters only. Thanks to Mr. Pablo of Mandrake for discussing the issue with me. I borrowed some of his comments.
This is no valid argumentation. The rules stem from data worked out by a group of experts on the topic and I trust them more then any random reporter who thinks s/he knows something. Either you specify *exactly* which rules in what locale you consider wrong and you back it up by providing supporting evidence (e.g., from national standards) or you can go away since nothing will ever be changed without following these procedures.
First, I am sorry that you felt as if I was pretending to "know something". Actually, I am not an expert at all in those issues and hence you need to help me report it in a better way if this is still not enough. Second, I am an Arabic native speaker (ar). I am also living in Saudi Arabia (SA). Also, we don't have our own English and we don't have "national standards" for English. We follow the known English standards available. The bug I am going to report here is concerned with locale ar_SA. If I have a file named "aa" and another named "a z", I would expect the command "ls" to display them with "aa" before "a z" as it happens when the locale is en_US, en_CA, en_GB, ... wich is not the case now.
I think indeed some LC_COLLATE definitions are wrong; like they haven't been rewritten/updated to benefit of the new (glibc > 2.2) possibilities. When you look at ar_SA, the LC_COLLATE is defined with lines like: order_start forward; forward <U0020> <U0020> ... <U0030> <U0030> <U0031> <U0031> <U0032> <U0032> .... <U0041> <U0041>;<U0041> <U0061> <U0041>;<U0061> ... if you compare with iso14651_t1 (used (maybe completed) by most other locales) you see things like this instead: <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> ... <U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0 <U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1 <U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2 ... <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a ... <U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A ... While ar_SA gives for each element only or in some cases two information tokens; the more modern LC_COLLATE definitions have 4. You can also see that while in ar_SA the space (<U0020>) is treated the same as the digits, on the more modern LC_COLLATE definition it is not; in fact the space is defined as sorting neutral. The latin letters have information telling if they are uppercase or lowercase in the modern LC_COLLATE; that information is missing in the definition in ar_SA da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but redefines everything itself (instead of including iso14651_t1 and only redefining what differs); spaces and blanks have 1st order sorting weight, which seems very strange to me, but even if Danish language sort spaces in such a peculiar way it is still strange to sort differently the space (0020) and the non breaking space (00A0), semantically they are the same thing, the difference is only typographical. While the sorting of letters is correct (at least for the letters used by a given language, ar_SA for example happily ignores any latin letter outside of ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts "agrave" after the last arabic letter...), the handling of punctuation and other special symbols should be reviewed imho. Also, all locales should include iso14651_t1 so that there can be an acceptable sorting for alphabetic symbols outside the range of the alphabet of the given locale (in an UTF-8 world you will likely see such things; I get for example mail from people with names having cacute, ccaron, lstroke, eogonek, etc. in my language none of those exist, but I expect them to be sorted with "c", "c", "l", "e" respectively, and not after "z".
Sigh! At last an expert came to the rescue ;)
Created attachment 370 [details] C source file for the tst-strcoll program This program can only process files composed of lines of 2 UTF-8 characters, some modifications are needed to accept any input.
Created attachment 371 [details] C source file for the tst-wcscoll program This program can only process files composed of lines of 2 UTF-8 characters, some modifications are needed to accept any input.
Comment on attachment 370 [details] C source file for the tst-strcoll program Oops. this patch was for BZ#368
If any locale definition should change, send a patch with justification. Just saying "I don't like it" achieves *nothing*. I'm closing this bug since there is absolutely no substance here. Locales are only updated if somebody who cares does the work.