Bug 25206 - strcoll sort result incorrect locale lv_LV.UTF-8
Summary: strcoll sort result incorrect locale lv_LV.UTF-8
Status: RESOLVED DUPLICATE of bug 23774
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.31
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-19 14:13 UTC by Carlos O'Donell
Modified: 2024-02-07 08:45 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:
fweimer: security-


Attachments
lv_LV.UTF-8.in_sorted (223 bytes, application/octet-stream)
2019-11-19 14:13 UTC, Carlos O'Donell
Details
lv_LV.UTF-8_with_more_chars_and_removed.in (257 bytes, application/octet-stream)
2019-11-19 14:13 UTC, Carlos O'Donell
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos O'Donell 2019-11-19 14:13:07 UTC
Created attachment 12082 [details]
lv_LV.UTF-8.in_sorted

In the downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1696729

The claim is made that lv_LV sorting is incorrect.

I have suggested that the following be sorted correctly:
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/lv_LV.UTF-8.in;h=db7e83c77e83183ee88eb9769f82a66c4cb758ab;hb=HEAD

Then we can use this as a reference for discussion.
Comment 1 Carlos O'Donell 2019-11-19 14:13:56 UTC
Created attachment 12083 [details]
lv_LV.UTF-8_with_more_chars_and_removed.in

Additional sorted file.
Comment 2 Rafal Luzynski 2019-11-25 11:20:17 UTC
Probably this should be addressed to Agris and I hope he is able to read this comment.

(In reply to Carlos O'Donell from comment #0)
> Created attachment 12082 [details]
> lv_LV.UTF-8.in_sorted
> 
> In the downstream bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1696729
> [...]

These details look suspicious to me for the following reasons.

1. The quoted rule says that "the string with capital letter is preferred".  What does it mean "preferred"?  To me it seems it means it should be sorted first.  This rule would be difficult for us to implement because probably we would have to reorder all letters.  I'm not sure if this is worth the effort.  But then the sample sort file lists uppercase letters second.  Which is fine for me but contradicting the rule.

2. My understanding of the rule quoted above is that it should be applied when the words differ only in the upper/lower case of the letter.  The letters should be compared ignoring the case first.  But the sample file sorts all uppercase words after all lowercase, for example:

a
ab
abc
ad
...
az
azzzxxyz
A
Abc
Az
AB

Is this really what we want?  I think that this rule would be very inconvenient for the users because they would have to be aware of this rule all the time and be ready to search for uppercased words always after the respective lowercase letters.

3. I totally understand and agree with one point.  The letters 'a' and 'ā' should be separated.  For example, now we have:

a
ā
ab
ācc
add
āfe
ah

but if I understand correctly this should be:

a
ab
add
ah
ā
ācc
āfe

This is understandable and easy to implement but that means that we all were wrong, by "all" I mean including CLDR.  But in order to confirm that CLDR was wrong I would like at least to see a ticket filed against CLDR and at least see no objection at their side.

4. Please note that the current sorting rules for lv_LV distinguish letters 'c' vs 'č', 'g' vs 'ģ', 'k' vs 'ķ' and several more.  Do I understand correctly that those letters work fine and the same rule should be applied to 'a' vs 'ā' and several more characters?

(In reply to Carlos O'Donell from comment #1)
> Created attachment 12083 [details]
> lv_LV.UTF-8_with_more_chars_and_removed.in
> 
> Additional sorted file.

5. This does not look good to me, either.  I mean, adding more test chars is a good idea but removing other test chars just because they are foreign to Latvian is not.  The sorting rules must somehow deal with even the most exotic characters.  This is the reason why we aim at starting the collating rules with “copy "iso14651_t1"” which aims to include all Unicode characters and only then we add rules specific for the current language.


TL;DR: If adding a rule to distinguish 'a' vs 'ā' plus several more similar characters is sufficient then we can easily implement this but the attached test cases need to be fixed.  Otherwise we'll have to verify if the required rules are correct.
Comment 3 Mike FABIAN 2024-02-07 08:45:54 UTC
I will sync with CLDR again.

*** This bug has been marked as a duplicate of bug 23774 ***