374 – The rules in LC_COLLATE are random and sometimes clearly wrong

Bug 374 - The rules in LC_COLLATE are random and sometimes clearly wrong

Summary: The rules in LC_COLLATE are random and sometimes clearly wrong

Status:	RESOLVED WONTFIX

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.3.3

Importance:	P2 minor
Target Milestone:	---
Assignee:	Petter Reinholdtsen

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-09-08 23:53 UTC by Munzir Taha
Modified:	2019-04-10 15:06 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
C source file for the tst-strcoll program (363 bytes, text/plain) 2005-01-17 21:42 UTC, Denis Barbier	Details
C source file for the tst-wcscoll program (442 bytes, text/plain) 2005-01-17 21:43 UTC, Denis Barbier	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Munzir Taha 2004-09-08 23:53:31 UTC

[root@localhost home]# LC_COLLATE=en_US ls -- 0 a A -a a- aa "a a" a-a "a z"
0  a  -a  a-  A  aa  a a  a-a  a z
[root@localhost home]# LC_COLLATE=en_CA ls -- 0 a A -a a- aa "a a" a-a "a z"
0  A  a  -a  a-  aa  a a  a-a  a z
[root@localhost home]# LC_COLLATE=da ls -- 0 a A -a a- aa "a a" a-a "a z"
-a  0  A  a  a a  a z  a-  a-a  aa
[root@localhost home]# LC_COLLATE=ar_SA ls -- 0 a A -a a- aa "a a" a-a "a z"
0  A  a  a a  a z  aa  a-  a-a  -a

da: (the character "-" has a 1st order sorting value, coming before letters and
numbers; on most other locales "-" is ignored in sorting)
ar_SA: (note how ar_SA handles "-" as a collatable element coming after "z")

Comment 1 GOTO Masanori 2004-09-09 16:27:05 UTC

Please describe what the problem is.  At least ISO/IEC defines
some locales (like en_US) collation that says a capital and 
small letter is combined; a A b B ... and so on.

BTW, what is locale "da"?
Execute "locale -a" and check whether "da" is available or not.

Comment 2 Munzir Taha 2004-09-12 18:38:20 UTC

Some hints: 
1. There should be no difference between en_US and en_CA. 
2. de (sorry not da) sorting is very odd. (the character "-" has a 1st order 
sorting value, coming before letters and numbers; on most other locales "-" is 
ignored in sorting) 
3. ar_SA handles "-" as a collatable element coming after "z". ar_SA defines 
LC_COLLATE using an old syntax (with only one level of collating weight); so 
maybe this special weight for "-" wasn't intended to be like that; just a 
side-effet. Maybe the LC_COLLATE section should be redefined to use the 
default one and only redefine (if needed) the sorting of arabic script letters 
only. 
 
Thanks to Mr. Pablo of Mandrake for discussing the issue with me. I borrowed 
some of his comments.

Comment 3 Ulrich Drepper 2004-09-26 09:33:53 UTC

This is no valid argumentation.

The rules stem from data worked out by a group of experts on the topic and I
trust them more then any random reporter who thinks s/he knows something.

Either you specify *exactly* which rules in what locale you consider wrong and
you back it up by providing supporting evidence (e.g., from national standards)
or you can go away since nothing will ever be changed without following these
procedures.

Comment 4 Munzir Taha 2004-09-30 20:25:59 UTC

First, I am sorry that you felt as if I was pretending to "know something". 
Actually, I am not an expert at all in those issues and hence you need to help 
me report it in a better way if this is still not enough. 
 
Second, I am an Arabic native speaker (ar). I am also living in Saudi Arabia 
(SA). Also, we don't have our own English and we don't have "national 
standards" for English. We follow the known English standards available. 
 
The bug I am going to report here is concerned with locale ar_SA. 
If I have a file named "aa" and another named "a z", I would expect the 
command "ls" to display them with "aa" before "a z" as it happens when the 
locale is en_US, en_CA, en_GB, ... wich is not the case now.

Comment 5 Pablo Saratxaga 2004-09-30 21:07:44 UTC

I think indeed some LC_COLLATE definitions are wrong; like they haven't been
rewritten/updated to benefit of the new (glibc > 2.2) possibilities.

When you look at ar_SA, the LC_COLLATE is defined with lines like:

order_start             forward; forward
<U0020> <U0020>
...
<U0030> <U0030>
<U0031> <U0031>
<U0032> <U0032>
....
<U0041> <U0041>;<U0041>
<U0061> <U0041>;<U0061>
...

if you compare with iso14651_t1 (used (maybe completed) by most other locales)
you see things like this instead:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
...
<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1
<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2
...
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
...
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
...

While ar_SA gives for each element only or in some cases two information tokens;
the more modern LC_COLLATE definitions have 4.
You can also see that while in ar_SA the space (<U0020>) is treated the same 
as the digits, on the more modern LC_COLLATE definition it is not; in fact the
space is defined as sorting neutral.
The latin letters have information telling if they are uppercase or lowercase
in the modern LC_COLLATE; that information is missing in the definition in ar_SA

da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but
redefines everything itself (instead of including iso14651_t1 and only
redefining what differs); spaces and blanks have 1st order sorting weight, which
seems very strange to me, but even if Danish language sort spaces in such a
peculiar way it is still strange to sort differently the space (0020) and the
non breaking space (00A0), semantically they are the same thing, the difference
is only typographical.

While the sorting of letters is correct (at least for the letters used by a
given language, ar_SA for example happily ignores any latin letter outside of
ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts
"agrave" after the last arabic letter...), the handling of punctuation and
other special symbols should be reviewed imho.
Also, all locales should include iso14651_t1 so that there can be an acceptable
sorting for alphabetic symbols outside the range of the alphabet of the given
locale (in an UTF-8 world you will likely see such things; I get for example
mail from people with names having cacute, ccaron, lstroke, eogonek, etc.
in my language none of those exist, but I expect them to be sorted with 
"c", "c", "l", "e" respectively, and not after "z".

Comment 6 Munzir Taha 2004-10-01 15:57:27 UTC

Sigh! At last an expert came to the rescue ;)

Comment 7 Denis Barbier 2005-01-17 21:42:24 UTC

Created attachment 370 [details]
C source file for the tst-strcoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.

Comment 8 Denis Barbier 2005-01-17 21:43:18 UTC

Created attachment 371 [details]
C source file for the tst-wcscoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.

Comment 9 Denis Barbier 2005-01-17 22:29:39 UTC

Comment on attachment 370 [details]
C source file for the tst-strcoll program

Oops. this patch was for BZ#368

Comment 10 Ulrich Drepper 2005-10-14 23:02:39 UTC

If any locale definition should change, send a patch with justification.  Just
saying "I don't like it" achieves *nothing*.  I'm closing this bug since there
is absolutely no substance here.  Locales are only updated if somebody who cares
does the work.