Bug 22469 - pl_PL LC_COLLATE does not use i18n
Summary: pl_PL LC_COLLATE does not use i18n
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.27
Assignee: Mike FABIAN
Depends on:
Reported: 2017-11-21 03:53 UTC by Mike FABIAN
Modified: 2017-12-01 01:09 UTC (History)
3 users (show)

See Also:
Last reconfirmed:
fweimer: security-

0001-pl_PL-locale-Base-collation-on-iso14651_t1.patch (14.40 KB, patch)
2017-11-23 10:42 UTC, Mike FABIAN
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mike FABIAN 2017-11-21 03:53:37 UTC
localedata/locales/pl_PL does not build upon localedata/locales/i18n, missing all updates from there.
Comment 1 Mike FABIAN 2017-11-23 10:42:15 UTC
Created attachment 10630 [details]

This patch uses “copy "iso14651_t1"”

and then implements the collatin rules for Polish from CLDR on top of that, see:


And, it also adds some rules to handle spaces in order not
to cause a regression for bug#388, see:

Comment 2 cvs-commit@gcc.gnu.org 2017-11-24 05:05:24 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  3ffc4cc1ad37fb36e419c9a3a72e1916d7d893d3 (commit)
      from  3a327316ad615f7e4264d3e13d23052d9dc84694 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------

commit 3ffc4cc1ad37fb36e419c9a3a72e1916d7d893d3
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Mon Nov 20 17:55:33 2017 +0530

    pl_PL locale: Base collation on iso14651_t1
    	[BZ #22469]
    	* localedata/locales/pl_PL (LC_COLLATE): Use “copy "iso14651_t1"”
    	and implement the collation rules for pl from CLDR on top of that.
    	* Makefile: Add pl_PL.UTF-8 to test-input and to the list
    	of locales to be built for testing.
    	* pl_PL.UTF-8.in: New file with test data to test the Polish sorting.


Summary of changes:
 ChangeLog                 |    9 +
 localedata/Makefile       |    6 +-
 localedata/locales/pl_PL  | 2116 ++-------------------------------------------
 localedata/pl_PL.UTF-8.in |  162 ++++
 4 files changed, 231 insertions(+), 2062 deletions(-)
 create mode 100644 localedata/pl_PL.UTF-8.in
Comment 3 Mike FABIAN 2017-11-24 05:06:25 UTC
Fixed in glibc master
Comment 4 Mike FABIAN 2017-11-24 05:07:36 UTC
Comment 5 Rafal Luzynski 2017-12-01 01:09:14 UTC
For the record and for the future reference: Polish alphabetical sorting is standardized by PN–80/N–01223 standard (by Polish Committee for Standardization). Some of its rules:

1. Alphabetical order must accord with the Polish alphabet with the letters: q, v, x added.
2. Non-Polish diacritical characters are ignored, ex.: Hašek < Hass
2a. It is also allowed to ignore Polish diacritical characters (although nobody seems to apply this rule, Polish diacritical characters are always respected).
3. Spaces and punctuation characters are before the letters, ex.: "mur z cegły" < "murawa".
4. Lowercase letter is before the uppercase, ex.: arab < Arab.
5. Numbers (also spelled) must be sorted according to their numerical value and placed before the letters, ex.: 1 < 5 < ósmy < trzynaście < 17 < XXI < Agnieszka < Antoni ... (This rule is difficult to implement, let's skip it.)
6. The placement of the Icelandic letter Þ (Thorn) is not regulated but the Icelandic alphabet places it at the end, after Z. We are encouraged to follow this rule as well, ex.: X < Y < Z < Þ.

Source: https://pl.wikipedia.org/wiki/Porz%C4%85dek_alfabetyczny

Another scientific source says that Polish language has two rules of sorting: for dictionaries the spaces and punctuation characters are ignored (letter-by-letter order) but for encyclopedias they are not (word-by-word order). Thanks to these rules people who don't know whether the correct spelling is „na pewno” or „napewno” will find the word ("na pewno" == "napewno"). On the other hand in encyclopedias all monarchs named Jan are grouped together: "Jan III Sobieski" < "Jan XXIII" < "Janina". We can't implement two different rules, here we have implemented the word-by-word rule and it is correct. The same has been requested in bug 388.

Source: https://sjp.pwn.pl/poradnia/haslo/porzadek-alfabetyczny-ale-jaki;16226.html

One more source saying that non-Polish diacritical characters should be ignored: https://sjp.pwn.pl/poradnia/haslo/porzadek-alfabetyczny;4208.html