Latvian language locale for Latvia has wrong collation order for Latvian vowels: A MACRON (U0100, U0101), E MACRON (U0112, U0113), I MACRON (U012A, U012B), O MACRON (U014C, U014D), and U MACRON (U016A, U016B). The first weight specifier for these letters should be equal to base letter (A, E, I, O, and U, respectively), and only the second weight specifier must be heavier. In other words, letters with macrons are sorted after the same letters without macrons only when string parts after the letter are equal. Note that diacritical consonants - C CARON, G CEDILLA, K CEDILLA, L CEDILLA, N CEDILLA, S CARON, and Z CARON - are always sorted after their base letters; for these letters the first weight specifier must be different, and that is correct with current version of the Latvian locale. Besides, current version of Latvian locale contains letter R WITH CEDILLA (U0156, U0157), which is now sorted separately from letter R with other diacritical marks. This letter is not currently used for Latvian writing in Latvia (it was used in the first half of the 20th century, and is still used by some Latvian communities outside Latvia), so the sorting rules for this letter are not obvious. I think that it would be better to make the first weight for letter R WITH CEDILLA equal to R because most of current Latvian language users cannot say when to use R with cedilla instead of R. Finally, current version of Latvian locale sorts capital letters before small letters, and that is not consistent with ISO14651 rules used by many glibc locales; some users complain about that too.
Theh CLDR collation rules for Latvian look like this: http://unicode.org/cldr/trac/browser/trunk/common/collation/lv.xml
Created attachment 10623 [details] 0001-lv_LV-locale-fix-collation-BZ-15537.patch Order without my patch: $ LC_ALL=lv_LV.UTF-8 ls Ʒ a Aa æ Āb c D i Y yb Īb ĵa L ņ ra Ŗa Sa š T Zb ža ʒ ʒa aa Ā āb Ç Ģ Ia y Ī īb Ĵb Ļ O Rb ŗa sa Ša Z zb Žb ȥ Ʒa Ab ā ʒb ç ģ ia Ya ī Ĵ ĵb ļ Ø rb Ŗb Sb ša z Ž žb Ȥ Å ab Āa Ʒb Č H Ib ya Īa ĵ Ķ M ø Ŗ ŗb sb Šb Za ž A å Æ āa C č I ib Yb īa Ĵa ķ Ņ Ra ŗ S Š šb za Ža $ Order with my patch: bash-4.4# LC_ALL=lv_LV.UTF-8 ls a Ā ab Æ č H y Īa īb Ĵ ķ M Ø ŗ Ŗb Sb šb Z zb Ža ʒa A aa Ab c Č i Y ya Īb ĵa Ķ ņ ra Ŗ S š Šb ȥ Zb žb Ʒa å Aa āb C D I ia Ya yb Ĵa L Ņ Ra ŗa sa Š t Ȥ ž Žb ʒb Å āa Āb ç ģ ī Ia ib Yb ĵb ļ O rb Ŗa Sa ša T za Ž ʒ Ʒb ā Āa æ Ç Ģ Ī īa Ib ĵ Ĵb Ļ ø Rb ŗb sb Ša z Za ža Ʒ bash-4.4#
(In reply to alexander smishlajev from comment #0) > Besides, current version of Latvian locale contains letter R WITH CEDILLA > (U0156, U0157), which is now sorted separately from letter R with other > diacritical marks. This letter is not currently used for Latvian writing in > Latvia (it was used in the first half of the 20th century, and is still used > by some Latvian communities outside Latvia), so the sorting rules for this > letter are not obvious. I think that it would be better to make the first > weight for letter R WITH CEDILLA equal to R because most of current Latvian > language users cannot say when to use R with cedilla instead of R. My patch fixes the problems you report, *except* the problem you report about R WITH CEDILLA. I fixed it by throwing away all the existing rules in LC_COLLATE in the lv_LV locale and do a copy "iso14651_t1" instead to include the default sort order. Then, on top of the default sort order I implemented the same rules as in http://unicode.org/cldr/trac/browser/trunk/common/collation/lv.xml This collation data from CLDR treats the R WITH CEDILLA as primary different from R, i.e. it continues to sort it the same way as the current lv_LV locale in glibc does. I don’t want to deviate from the CLDR collation data for no good reason, so if this is really wrong it would be good to report a bug against CLDR. But I guess it is correct because it cites a Latvian dictionary as a reference.
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, master has been updated via 4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404 (commit) from 922bb78c0c074aaeaa9f0312195b717674ed7430 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404 commit 4b7af5fca7db9fe1f4c078c57f20a08e2a1e2404 Author: Mike FABIAN <mfabian@redhat.com> Date: Fri Nov 17 10:54:52 2017 +0100 lv_LV locale: fix collation [BZ #15537] [BZ #15537] * localedata/locales/lv_LV (LC_COLLATE): Fix collation by using “copy "iso14651_t1"” and then implementing the collation rules for lv from CLDR on top of that. * Makefile: Add lv_LV.UTF-8 to test-input and to the list of locales to be built for testing. * lv_LV.UTF-8.in: New file with test data to test the Latvian sorting. Reviewed-by: Carlos O'Donell <carlos@redhat.com> ----------------------------------------------------------------------- Summary of changes: ChangeLog | 11 + localedata/Makefile | 4 +- localedata/locales/lv_LV | 2107 +------------------------------------------- localedata/lv_LV.UTF-8.in | 105 +++ 4 files changed, 166 insertions(+), 2061 deletions(-) create mode 100644 localedata/lv_LV.UTF-8.in
Fixed in glibc master.