15537 – lv_LV: invalid collation for Latvian diacritical letters

Bug 15537 - lv_LV: invalid collation for Latvian diacritical letters

Summary: lv_LV: invalid collation for Latvian diacritical letters

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.18

Importance:	P2 normal
Target Milestone:	2.27
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-05-26 14:22 UTC by alexander smishlajev
Modified:	2017-11-22 05:14 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
0001-lv_LV-locale-fix-collation-BZ-15537.patch (12.94 KB, patch) 2017-11-20 08:49 UTC, Mike FABIAN	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description alexander smishlajev 2013-05-26 14:22:42 UTC

Latvian language locale for Latvia has wrong collation order for Latvian vowels: A MACRON (U0100, U0101), E MACRON (U0112, U0113), I MACRON (U012A, U012B), O MACRON (U014C, U014D), and U MACRON (U016A, U016B).  The first weight specifier for these letters should be equal to base letter (A, E, I, O, and U, respectively), and only the second weight specifier must be heavier.  In other words, letters with macrons are sorted after the same letters without macrons only when string parts after the letter are equal.

Note that diacritical consonants - C CARON, G CEDILLA, K CEDILLA, L CEDILLA, N CEDILLA, S CARON, and Z CARON - are always sorted after their base letters; for these letters the first weight specifier must be different, and that is correct with current version of the Latvian locale.

Besides, current version of Latvian locale contains letter R WITH CEDILLA (U0156, U0157), which is now sorted separately from letter R with other diacritical marks.  This letter is not currently used for Latvian writing in Latvia (it was used in the first half of the 20th century, and is still used by some Latvian communities outside Latvia), so the sorting rules for this letter are not obvious.  I think that it would be better to make the first weight for letter R WITH CEDILLA equal to R because most of current Latvian language users cannot say when to use R with cedilla instead of R.

Finally, current version of Latvian locale sorts capital letters before small letters, and that is not consistent with ISO14651 rules used by many glibc locales; some users complain about that too.

Comment 1 Mike FABIAN 2017-10-30 07:51:55 UTC

Theh CLDR collation rules
for Latvian look like this:

http://unicode.org/cldr/trac/browser/trunk/common/collation/lv.xml

Comment 2 Mike FABIAN 2017-11-20 08:49:08 UTC

Created attachment 10623 [details]
0001-lv_LV-locale-fix-collation-BZ-15537.patch

Order without my patch:

$ LC_ALL=lv_LV.UTF-8 ls
Ʒ  a   Aa  æ   Āb  c  D  i   Y   yb  Īb  ĵa  L  ņ   ra  Ŗa  Sa  š   T   Zb  ža
ʒ  ʒa  aa  Ā   āb  Ç  Ģ  Ia  y   Ī   īb  Ĵb  Ļ  O   Rb  ŗa  sa  Ša  Z   zb  Žb
ȥ  Ʒa  Ab  ā   ʒb  ç  ģ  ia  Ya  ī   Ĵ   ĵb  ļ  Ø   rb  Ŗb  Sb  ša  z   Ž   žb
Ȥ  Å   ab  Āa  Ʒb  Č  H  Ib  ya  Īa  ĵ   Ķ   M  ø   Ŗ   ŗb  sb  Šb  Za  ž
A  å   Æ   āa  C   č  I  ib  Yb  īa  Ĵa  ķ   Ņ  Ra  ŗ   S   Š   šb  za  Ža
$

Order with my patch:

bash-4.4# LC_ALL=lv_LV.UTF-8 ls 
a  Ā   ab  Æ  č  H  y	Īa  īb	Ĵ   ķ  M  Ø   ŗ   Ŗb  Sb  šb  Z   zb  Ža  ʒa
A  aa  Ab  c  Č  i  Y	ya  Īb	ĵa  Ķ  ņ  ra  Ŗ   S   š   Šb  ȥ   Zb  žb  Ʒa
å  Aa  āb  C  D  I  ia	Ya  yb	Ĵa  L  Ņ  Ra  ŗa  sa  Š   t   Ȥ   ž   Žb  ʒb
Å  āa  Āb  ç  ģ  ī  Ia	ib  Yb	ĵb  ļ  O  rb  Ŗa  Sa  ša  T   za  Ž   ʒ   Ʒb
ā  Āa  æ   Ç  Ģ  Ī  īa	Ib  ĵ	Ĵb  Ļ  ø  Rb  ŗb  sb  Ša  z   Za  ža  Ʒ
bash-4.4#

Comment 3 Mike FABIAN 2017-11-20 08:57:22 UTC

(In reply to alexander smishlajev from comment #0)

> Besides, current version of Latvian locale contains letter R WITH CEDILLA
> (U0156, U0157), which is now sorted separately from letter R with other
> diacritical marks.  This letter is not currently used for Latvian writing in
> Latvia (it was used in the first half of the 20th century, and is still used
> by some Latvian communities outside Latvia), so the sorting rules for this
> letter are not obvious.  I think that it would be better to make the first
> weight for letter R WITH CEDILLA equal to R because most of current Latvian
> language users cannot say when to use R with cedilla instead of R.

My patch fixes the problems you report, *except* the problem you
report about R WITH CEDILLA.

I fixed it by throwing away all the existing rules in LC_COLLATE in the
lv_LV locale and do a 

copy "iso14651_t1"

instead to include the default sort order.

Then, on top of the default sort order I implemented the same
rules as in

http://unicode.org/cldr/trac/browser/trunk/common/collation/lv.xml

This collation data from CLDR treats the R WITH CEDILLA as primary different
from R, i.e. it continues to sort it the same way as the current
lv_LV locale in glibc does.

I don’t want to deviate from the CLDR collation data for no good reason,
so if this is really wrong it would be good to report a bug
against CLDR. But I guess it is correct because it cites
a Latvian dictionary as a reference.

Comment 4 Sourceware Commits 2017-11-22 05:05:11 UTC

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404 (commit)
      from  922bb78c0c074aaeaa9f0312195b717674ed7430 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404

commit 4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Fri Nov 17 10:54:52 2017 +0100

    lv_LV locale: fix collation [BZ #15537]
    
    	[BZ #15537]
    	* localedata/locales/lv_LV (LC_COLLATE): Fix collation by
    	using “copy "iso14651_t1"” and then implementing the
    	collation rules for lv from CLDR on top of that.
    	* Makefile: Add lv_LV.UTF-8 to test-input and to the list
    	of locales to be built for testing.
    	* lv_LV.UTF-8.in: New file with test data to test the Latvian
    	sorting.
    
    Reviewed-by: Carlos O'Donell <carlos@redhat.com>

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                 |   11 +
 localedata/Makefile       |    4 +-
 localedata/locales/lv_LV  | 2107 +-------------------------------------------
 localedata/lv_LV.UTF-8.in |  105 +++
 4 files changed, 166 insertions(+), 2061 deletions(-)
 create mode 100644 localedata/lv_LV.UTF-8.in

Comment 5 Mike FABIAN 2017-11-22 05:14:14 UTC

Fixed in glibc master.