Bug 18587 - Minor collate issues in Hungarian locale
Summary: Minor collate issues in Hungarian locale
Status: RESOLVED DUPLICATE of bug 18934
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.21
: P2 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
Depends on:
Reported: 2015-06-23 22:05 UTC by Egmont Koblinger
Modified: 2017-03-28 14:36 UTC (History)
1 user (show)

See Also:
Last reconfirmed:
fweimer: security-

Fix (1.40 KB, patch)
2015-06-23 22:05 UTC, Egmont Koblinger
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Egmont Koblinger 2015-06-23 22:05:26 UTC
Created attachment 8385 [details]

There are two minor issues with the Hungarian locale when sorting strings that only differ in their case. Please apply the attached patch to fix them.

Issue 1:

Most of the time the lowercase counterpart is sorted before the uppercase; however it's not the case for "CS" < "Cs", and similarly for all the other double consonants (dz, gy, ..., there are 8 of them in total).

To test:

LC_ALL=hu_HU.UTF-8 sort -k 1,1 -s << END
cs 1
cS 2
Cs 3
CS 4

Expected output: according to the numbers. Current output: in the order 1 2 4 3.

The fix copies the pattern found at the only triple consonant "dzs", by using the new <MIN-MIN> or <CAP-CAP> instead of <MIN> or <CAP> to explicitly denote the case of both of the codepoints in the compound letter. This also makes the file's layout more nicely tabulated and easier to read.

Issue 2:

When the only triple letter "dzs" is pronounced long, it's spelled as "ddzs", however, due to stupid obvious typos of using <CAP-x-y> instead of <MIN-x-y> (this mistake might have been introduced by me a long time ago, can't remember), the case of the second "d" is ignored rather than lowercase being sorted before uppercase.

To test:

LC_ALL=hu_HU.UTF-8 sort -k 1,1 -s << END
DDzs 2
Ddzs 1
DDzs 3

Expected output: according to the numbers. Actual output: unchanged order, proving that they all compare equal.

On a slightly related note: the new version of the Hungarian spelling rules is planned to be released this September [1], replacing the current 30 year old version. The old version's section about alphabetical sorting doesn't say what to do when only the case differs. Allegedly the new version will specify that lowercase is to be sorted first, followed by uppercase: [2] -> "arany, Arany", which is what the current version already implements - apart from these bugs. So this patch is also in preparation for the new rules.

[1] http://mta.hu/mta_hirei/szeptemberben-jelenik-meg-a-magyar-helyesiras-szabalyai-tizenkettedik-kiadasa-136386/
[2] http://www.nyest.hu/hirek/mi-ujsag-a-helyesirasban
Comment 1 Egmont Koblinger 2015-09-08 08:38:43 UTC
I discovered other bugs as well, and created a patch that does not only address all of them but also adds extensive test coverage. I wouldn't want to pollute this bug by squeezing in new ones, so I decided to create a new one.

Let's mark this bug as obsoleted by bug 18934.

*** This bug has been marked as a duplicate of bug 18934 ***
Comment 2 cvs-commit@gcc.gnu.org 2017-03-28 14:36:39 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  ea1898dded26316e2e73adfb409224e864ffaa8b (commit)
      from  78c05814320cdc3377347f8e5fdbaa7cf5abf5b5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------

commit ea1898dded26316e2e73adfb409224e864ffaa8b
Author: Egmont Koblinger <egmont@gmail.com>
Date:   Wed Mar 22 21:27:30 2017 -0400

    localedata: hu_HU: fix multiple sorting bugs (bug 18934)
    Fix the incorrect sorting order of a digraph and its geminated variant,
    regression introduced by a faulty fix to bug 13547 in commit
    Fix two inconsistencies in sorting unusual capitalization of digraphs
    (bug #18587).
    Enable DIACRIT_FORWARD to work around bug #17750.
    Sort foreign accents after the Hungarian ones.
    Add extensive unittests containing all the examples from The Rules of
    Hungarian Orthography and many more, including explanatory comments.


Summary of changes:
 NEWS                     |    4 +
 localedata/ChangeLog     |    7 +
 localedata/Makefile      |    4 +-
 localedata/hu_HU.in      |  560 ++++++++++++++++++++++++++++++++++++++++++++++
 localedata/locales/hu_HU |  286 ++++++++++++------------
 5 files changed, 716 insertions(+), 145 deletions(-)
 create mode 100644 localedata/hu_HU.in