14038 – strcoll sorting order

Bug 14038 - strcoll sorting order

Summary: strcoll sorting order

Status:	RESOLVED INVALID

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.13

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-05-01 03:42 UTC by Andrzej
Modified:	2014-06-25 11:10 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrzej 2012-05-01 03:42:22 UTC

(not sure if that's an implementation or documentation issue)

In utf8 locales, some string comparisons depend on the length of the strings, not sure if that's supposed to work that way (if so, it would be good to have a reference to a standard defining these rules in the docs) or it is just a bug.

For example, if strcoll is used as a comparison function, these strings will be sorted as follows:

あ
a
あa
aa
あaa
aaa
あaaa

I'd expect the following order to be correct:

あ
あa
あaa
あaaa
a
aa
aaa

Comment 1 Andrzej 2012-05-01 03:58:14 UTC

Just to clarify, I run into this issue(?) when we tried to optimize sorting in our application.

Our assumption was that, knowing that the first character of two strings are different, comparing just these characters is as good as comparing the whole strings, that is if 'あ' < 'a' then 'あaaa' < 'aa'. This assumption fails with the current design of strcoll.

Comment 2 Andreas Schwab 2012-05-01 06:48:47 UTC

This is a bad assumption in any case because the sorting algorithm may ignore some characters in the first pass.

The common

Comment 3 Andreas Schwab 2012-05-01 06:54:12 UTC

The common sorting weights from iso14651_t1_common has no entry for japanese characters, so they are ignored in the first pass.  The ja_JP locale sorts them after the latin characters.

Comment 4 Petr Baudis 2012-05-01 08:58:15 UTC

Marking as INVALID, thanks to Andreas for taking care to explain. Indeed, the sorting is locale-dependent and may ignore various (usually the unknown) characters. Set LC_COLLATE to POSIX if you want "programmer-friendly" sorting order. Andrzej, feel free to reopen if you have more questions.

Comment 5 Andrzej 2012-05-01 10:44:21 UTC

Just wanted to ask if there is any plan of adding Japanese definition to iso14651_t1_common file. The current behavior doesn't seems particularly useful.

Also, the documentation issue is still valid - for a nontrivial function like this, there should be at least some hints about where to find the comparison rules or what standards does it comply with.

(I'm satisfied with your explanation so I don't reopen the bug. Please feel free to reopen/reassign it if you think the above issues need to be addressed.)