11559 – Wrong sorting of the space in cs_CZ locale

Bug 11559 - Wrong sorting of the space in cs_CZ locale

Summary: Wrong sorting of the space in cs_CZ locale

Status:	RESOLVED WONTFIX

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	GNU C Library Locale Maintainers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-04-30 05:40 UTC by Martin Edlman
Modified:	2014-06-30 09:18 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Edlman 2010-04-30 05:40:17 UTC

There is a problem with sorting of the space in the cs_CZ locale (as in many
other locales). 

According to Czech Standard 97 6030 Alphabetical ordering (Czech Institute of
Standards, Prague 1993) [Czech: ČSN 97 6030 Abecední řazení (Český normalizační
institut, Praha 1993)]:
The space between two contextual characters should be considered as a single
character. The space is sorted before the first letter of the alphabet. For
example: 

Novak Zdenek
Novakova Jana

So I propose to change the behavior of sorting the space from
<U0020> IGNORE;IGNORE;IGNORE;<U0020>
to
<U0020> <U0020>;IGNORE;<U0020>;<U0020> 

The problem is that this change takes spaces on the beginning of a line into
account, which is not correct as it sorts "Novák" and "(space)Zounar" as

(space)Zounar
Novák

instead of correct

Novák
(space)Zounar

The same applies to multiple spaces, they should be considered as one space, so
"Novák(space)Jan" and "Novák(space)(space)Zdenek" are incorrectly sorted as

Novák(space)(space)Zdenek
Novák(space)Jan

insted of correct

Novák(space)Jan
Novák(space)(space)Zdenek

Is it possible to fix this behavior in locale definition? It should definitely
solve the problem and fulfill the standard.

Comment 1 Ulrich Drepper 2011-05-09 23:40:30 UTC

There is no way to express "not at the beginning of a line".  Therefore whatever way it is done there is a problem.  There is no way to make everyone happy.  I suggest to leave it as is.  If you can get the previous authors of changes to the locale (see the file) to agree with a change I'll reconsider.

Comment 2 Martin Edlman 2011-05-10 06:22:33 UTC

I contacted authors of the original cs_CZ locale, they redirected me to report it as a bug, which I did.
So, let it be as it is. 
I'll change the locales myself on the servers where I need correct space ordering. If there is a change in possibilities of definition so it'd be possible to define behaviour of white space chars on the beginning or end of the text and repeated white space chars, it would be great.
I have no idea how much work it would be to change the code and if it would be useful for anyone else.

Comment 3 Ulrich Drepper 2011-05-14 03:26:32 UTC

(In reply to comment #2)
> I have no idea how much work it would be to change the code and if it would be
> useful for anyone else.

It simply isn't possible with the current interfaces.