Bug 23774 - lv_LV collates Y/y incorrectly
Summary: lv_LV collates Y/y incorrectly
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 minor
Target Milestone: 2.40
Assignee: Mike FABIAN
URL:
Keywords:
: 25206 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-10-14 08:58 UTC by Danko Alexeyev
Modified: 2024-02-08 07:38 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2024-02-07 00:00:00
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Danko Alexeyev 2018-10-14 08:58:37 UTC
Commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 introduced this change in the lv_LV locale:

-<U0079> <i>;<PCL>;<MIN>;IGNORE % y
-<U0059> <i>;<PCL>;<CAP>;IGNORE % Y
+<U0079> <S0069>;<LOWLINE>;<MIN>;IGNORE % y
+<U0059> <S0069>;<LOWLINE>;<CAP>;IGNORE % Y

I don't know what "PCL" meant and whether "Y" was supposed to be "BASE" in the first place, but "LOWLINE" certainly looks like a bug.

Letter Y is not present in the Latvian alphabet, however it is present in Latgalian and is located after I, which is what the CLDR rule seems to suggest:

&I<<y<<<Y

I found this by accident while investigating the result of this command on my system (with LANG being lv_LV.UTF-8)

$ echo abcxyz | grep -Eo '[a-z]+'
abcx
z

I'm sorry if I misunderstood something as I've never worked with either glibc or CLDR locales directly before.
Comment 1 Mike FABIAN 2018-10-15 12:11:40 UTC
(In reply to Danko Alexeyev from comment #0)
> Commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 introduced this change in
> the lv_LV locale:
> 
> -<U0079> <i>;<PCL>;<MIN>;IGNORE % y
> -<U0059> <i>;<PCL>;<CAP>;IGNORE % Y
> +<U0079> <S0069>;<LOWLINE>;<MIN>;IGNORE % y
> +<U0059> <S0069>;<LOWLINE>;<CAP>;IGNORE % Y
> 
> I don't know what "PCL" meant and whether "Y" was supposed to be "BASE" in
> the first place, but "LOWLINE" certainly looks like a bug.

PCL was an old collation symbol which was used in an older version of
the glibc/localedata/locales/iso14651_t1_common file.  It was a second
level collation symbol.

To get the right sort order, replacing it by any existing secondary
collation symbol except "BASE" works fine here.

The current glibc/localedata/locales/iso14651_t1_common contains:

    % Second-level collating symbols

    collating-symbol <BASE>
    collating-symbol <LOWLINE>  % COMBINING LOW LINE
    collating-symbol <PSILI>  % COMBINING COMMA ABOVE
    collating-symbol <DASIA>  % COMBINING REVERSED COMMA ABOVE
    collating-symbol <AIGUT>  % COMBINING ACUTE ACCENT
    ...

<BASE> means base letter, all the  following collation symbols can be used
to indicate secondary differences to base letters. As there is nothing
particularly appropriate for the difference between i and y, it doesn’t
really matter which one is used, so I did choose the first one, LOWLINE.

> Letter Y is not present in the Latvian alphabet, however it is present in
> Latgalian and is located after I, which is what the CLDR rule seems to
> suggest:
> 
> &I<<y<<<Y

This rule means that y is sorted after I *but* only as secondary difference
("<" is a primary difference, "<<" is a secondary difference, "<<<" is a tertiary difference).
Secondary differences are "accent" differences, i.e. y is treated here
not as a really different letter from I (That would be a primary difference),
but as a "accent" variation of I. Tertiary differences are often used
for upper/lower differences, which is the case here, i.e. the difference
between y and Y is a upper/lower difference.

If you look at the sorting test file for lv_LV.UTF-8:

glibc/localedata/lv_LV.UTF-8.in

you will find that it contains:

i
I
ī
Ī
y
Y
ia
Ia
īa
Īa
ya
Ya
ib
Ib
īb
Īb
yb
Yb

If y were primary different from i, ya would be sorted *after* ib.
But as it is only a secondary difference, the primary difference between 
a and b decides the order for the strings ya and ib.

> I found this by accident while investigating the result of this command on
> my system (with LANG being lv_LV.UTF-8)
> 
> $ echo abcxyz | grep -Eo '[a-z]+'
> abcx
> z
> 
> I'm sorry if I misunderstood something as I've never worked with either
> glibc or CLDR locales directly before.

This fails for other reasons, not because of the use of LOWLINE.
Comment 2 Mike FABIAN 2018-10-15 12:33:56 UTC
See also

https://bugzilla.redhat.com/show_bug.cgi?id=1631472#c3

for a similar case in Swedish.
Comment 3 Reinis Danne 2018-12-22 19:56:28 UTC
sed-4.6 and grep-3.3 seem to have resolved this particular issue by implementing rational range interpretation, but [a-ž] and [A-Ž] are buggy.

The former de-interleaves the capital letters for unaccented characters, but accented capitals are left among the small letters.

Does glibc (2.28) offer alternative collations (or does grep does it)?
As far as I could tell the collation sequence is as specified in the locale:
Using LC_COLLATE=lv_LV.UTF-8
char	strxfrm
i	c2b7010201020101e29b96
I	c2b7010201070101e2afb7
ī	c2b70102140102020101e29bb7
Ī	c2b70102140107020101e2b096
y	c2b701030102
Y	c2b701030107
j	c382010201020101e29c96
J	c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char	strxfrm
i	6b
I	4b
ī	c4ad
Ī	c4ac
y	7b
Y	5b
j	6c
J	4c
Comment 4 Carlos O'Donell 2021-09-08 15:12:24 UTC
The notes from Mike indicate that this is not a bug in the glibc locale data for lv_LV and that we are harmonized with CLDR. I haven't seen further comments from Danko Alexeyev to refute that. I'm marking this RESOLVED NOTABUG.

To answer Reinis Dane's question, yes you can make alternative collations, but they must be alternative locales e.g. lv_LV@alt1.utf8 where the "@alt1" to define a suffice for an alternative locale e.g. lv_LV@alt1 which has distinct collation from lv_LV.
Comment 5 Rudolfs Mazurs 2023-12-08 21:19:55 UTC
This bug is still relevant. Collation for lv_LV locale should be: i, y, ī, not i, ī, y.

Faulty behaviour was introduced in the Bug 15537, 0001-lv_LV-locale-fix-collation-BZ-15537.patch

The correct ordering is described in LVS 24:1993 standard, the relevant section was quoted in https://unicode-org.atlassian.net/browse/CLDR-6475 bug report.
Comment 6 Rudolfs Mazurs 2024-02-01 22:10:52 UTC
This issue was fixed in other CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-11982

The new collation rules are in the file https://github.com/unicode-org/cldr/blob/main/common/collation/lv.xml and it's set to be released in the CLDR version 45.

Should I attempt to make a git patch?
Comment 7 Mike FABIAN 2024-02-07 08:44:11 UTC
Reopen.
Comment 8 Mike FABIAN 2024-02-07 08:45:54 UTC
*** Bug 25206 has been marked as a duplicate of this bug. ***