Bug 13547

Summary: Different strings collate as equal in Hungarian
Product: glibc Reporter: Egmont Koblinger <egmont>
Component: localedataAssignee: GNU C Library Locale Maintainers <libc-locales>
Status: RESOLVED FIXED    
Severity: normal CC: drepper.fsp
Priority: P2 Flags: fweimer: security-
Version: 2.14   
Target Milestone: ---   
See Also: https://sourceware.org/bugzilla/show_bug.cgi?id=18934
Host: Target:
Build: Last reconfirmed:
Attachments: collate fix for Hungarian
collate fix for Hungarian

Description Egmont Koblinger 2012-01-03 00:17:49 UTC
Created attachment 6139 [details]
collate fix for Hungarian

Please apply the attached patch to the Hungarian locale definition.

Using the current definition, certain strings collate as equal, e.g. strcoll("ccs", "cscs") returns zero. This causes confusion with programs such as sort (the order is undefined, might vary from run to run), or uniq (different lines being reported as equal).

The given patch addresses this problem and makes them collate as different, without modifying the actual sorting order of valid Hungarian words.

The problem in more detail:

We have compound letters, such as "sh" in English, e.g. we have "cs". Whenever such a letter is pronounced long, we write it using a shorthand "ccs" notation (only the first letter is duplicated), rather than "cscs".

Currently "ccs" is tokenized as <cs><cs>, which is correct, but "cscs" (not used in valid Hungarian words, but might occur in text files anyways) is also tokenized as <cs><cs>, hence they collate equal.

The solution is to tokenize "ccs" as <c_or_cs><cs>, and reorder the tokens like <a> <b> <c> <c_or_cs> <cs> <d> ...

The problem was originally discovered at http://hup.hu/node/110267 (forum in Hungarian).
Comment 1 Egmont Koblinger 2012-01-03 00:28:36 UTC
Created attachment 6140 [details]
collate fix for Hungarian
Comment 2 Ulrich Drepper 2012-01-07 16:05:07 UTC
I added the patch.
Comment 3 Egmont Koblinger 2015-09-08 08:35:49 UTC
Please note that the patch applied here was incorrect. It fixed a corner case, while broke a more generic one.

By tokenizing "ssz" as <s_or_sz><sz> rather than <sz><sz>, and ordering the tokens as <s> < <s_or_sz> < <sz>, the corner case when the only difference in the two words is "ssz" vs. "szsz" is fixed.

However, sorting of e.g. "kasza" <k><a><sz><a> vs. "kassza" <k><a><s_or_sz><sz><a> became broken. The correct ordering would be "kasza" < "kassza" (since it's actually <k><a><sz><sz><a>), but with the current solution they're ordered backwards (due to <s_or_sz> preceding <sz>).

The solution is to tokenize both "ssz" and "szsz" as <sz><sz> (as we did before), but apply something weaker, something along the lines of a "fake accent" (SINGLE-OR-COMPOUND vs. COMPOUND) on top of them that might distinguish later.

Let's leave this bug closed. A fix is available in bug 18934.
Comment 4 Sourceware Commits 2017-03-28 14:36:39 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  ea1898dded26316e2e73adfb409224e864ffaa8b (commit)
      from  78c05814320cdc3377347f8e5fdbaa7cf5abf5b5 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=ea1898dded26316e2e73adfb409224e864ffaa8b

commit ea1898dded26316e2e73adfb409224e864ffaa8b
Author: Egmont Koblinger <egmont@gmail.com>
Date:   Wed Mar 22 21:27:30 2017 -0400

    localedata: hu_HU: fix multiple sorting bugs (bug 18934)
    
    Fix the incorrect sorting order of a digraph and its geminated variant,
    regression introduced by a faulty fix to bug 13547 in commit
    b008d4c85619a753e441d7f473ba8af0db400bd6.
    
    Fix two inconsistencies in sorting unusual capitalization of digraphs
    (bug #18587).
    
    Enable DIACRIT_FORWARD to work around bug #17750.
    
    Sort foreign accents after the Hungarian ones.
    
    Add extensive unittests containing all the examples from The Rules of
    Hungarian Orthography and many more, including explanatory comments.

-----------------------------------------------------------------------

Summary of changes:
 NEWS                     |    4 +
 localedata/ChangeLog     |    7 +
 localedata/Makefile      |    4 +-
 localedata/hu_HU.in      |  560 ++++++++++++++++++++++++++++++++++++++++++++++
 localedata/locales/hu_HU |  286 ++++++++++++------------
 5 files changed, 716 insertions(+), 145 deletions(-)
 create mode 100644 localedata/hu_HU.in