Bug 1430 - [2.4/2.5 regression] worse collation for hu_HU
Summary: [2.4/2.5 regression] worse collation for hu_HU
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: 2.4
: P2 critical
Target Milestone: ---
Assignee: GNU C Library Locale Maintainers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-06 17:45 UTC by Egmont Koblinger
Modified: 2016-05-20 19:55 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
dictionary (218.95 KB, image/jpeg)
2006-05-03 14:05 UTC, Egmont Koblinger
Details
phonebook (845.03 KB, image/jpeg)
2006-05-03 14:07 UTC, Egmont Koblinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Egmont Koblinger 2005-10-06 17:45:07 UTC
Please revert libc/localedata/locales/hu_HU revision 1.18, "Better collation".
It is not better, it is worse.

According to the Hungarian rules, aacute, eacute, iacute, oacute and uacute
must be treated the same as their unaccented counterparts, also wovels with
diaeresis should be treated the same as ther counterparts with double acutes.
In other words:
a = á < e = é < i = í < o = ó < ö = &#337; < u = ú < ü = &#369;

For example, the following is a correct alphabetical order:
ablak
állat
apa
áru
az

These wovels in one equivalence class only make a difference if they are the
only letters which differ, e.g.:
Eger
egér
éger
eget
éget

This was perfectly implemented in the previous version, as well as mentioned
in some comment lines within this file (which comment is still there although
it doesn't correspond to what's implemented right now).

I don't know who and why suggested the modifications of 1.18, but he was surely
wrong. If needed, I can scan some pages of dictionaries or phone books and
upload it to prove these sorting rules.

If someone just happens to prefer sorting this way, then he is of course
absolutely free to create an own locale for himself, or set LC_COLLATE=C or
something similar, but there's hardly any place for that work in glibc. Glibc
should follow the national rules, and r1.18 was a move against it.


Ulrich, If I recall correctly, some years ago it was you to whom I sent the
hu_HU sorting rules which fixed some bugs. Then you asked me to manually
sort a lot of words you had previously received from some other Hungarian guy
and test whether glibc sorts it in the same order. Then glibc with those
Hungarian collating rules passed that test, but the new rules would obviously
fail on them. Do you happen to still have that file? (I don't think I have
them, but I'll take a look at it.)

I guess it would be a really wise move to put such kind of sorted files into
glibc's source and add a sorting test case for them.


Ps1: a and á, as well as e and é are different voices so it's often argued
if it's logical to put them in the same group, this is rather a tradition than
a logical decision. On the other hand, i and í, o and ó, ö and &#337;, u and ú, and
finally ü and &#369; are the same voices, with the latter ones pronounced longer.
Crosswords and similar stuff treat a and á, and é and é differently, while the
other pairs are interchangeable there. But alphabetical sorting uses different
rules.

Ps2: All the words above in the examples are real Hungarian words.
Comment 1 Ulrich Drepper 2005-10-14 20:24:59 UTC
Discuss this with the other reported and get back with the result.  I have no
reason to believe anyone over somebody else.
Comment 2 Egmont Koblinger 2005-10-17 07:57:14 UTC
Who is the other reporter? Please give me some contact info, I couldn't find
such an entry in this bugzilla.
Comment 3 Jakub Jelinek 2005-10-17 08:01:12 UTC
2005-07-26  Ulrich Drepper  <drepper@redhat.com>

        * locales/hu_HU: Better collation.
        Patch by Gyuro Lehel <lehel@freemail.hu>.
Comment 4 Egmont Koblinger 2005-10-18 11:06:20 UTC
I received a reply from Lehel. He writes (in Hungarian) that he doesn't want to
create an account in bugzilla because he receives twice as much spam since he
registered in redhat bugzilla. On the other hand he asked me to copy/paste
this text here:

Well, I do not argue the point, it was just the customers at my old job
who did not really liked this kind of sorting. Maybe the solution could
be to add a locale that contains the alphabetical sorting and let the
users choose their preferred one.
Comment 5 Ulrich Drepper 2005-10-18 14:24:16 UTC
> Maybe the solution could be to add a locale that contains the alphabetical
> sorting and let the users choose their preferred one.

No, creating variant locales is not an option.  There is one and only one locale.
Comment 6 Egmont Koblinger 2005-10-18 14:29:48 UTC
> No, creating variant locales is not an option.

I perfectly agree, I also answered him this. (If there are 2 choices then
in a few minutes there'll be request for about 2^N choices where N keeps on
growing forever...).
Comment 7 Ulrich Drepper 2006-04-25 18:27:56 UTC
No response in 6+ months.  Closing.
Comment 8 Egmont Koblinger 2006-04-25 18:50:40 UTC
No response to what? Sorry, but I think that _I_'ve been waiting 6+ months for
_you_ to fix this bug.

I told you that Lehel agreed in private mail that he was wrong and I am right,
unfortunately I couldn't get him to comment here in bugzilla so I cannot prove
this, but I hope you do not think I'm lying; and it's not my fault that he is
not as co-operative as he should be.

In the original report I told "If needed, I can scan some pages of
dictionaries..." It's not easy for me to find access to a scanner but I am
happily willing to do this _if_ I know that it's needed to get this bug fixed.
But I still don't know if that would make you happy, you haven't replied
anything like "yes, scanning those pages would be cool".

I'll be back shortly with some scanned pages. If that's not enough then please,
please let me know what to do to prove I'm right.
Comment 9 Egmont Koblinger 2006-05-03 14:05:08 UTC
Created attachment 999 [details]
dictionary

A random page scanned from a Hungarian-German dictionary. Words beginning with
e and é appear in mixed order.
Comment 10 Egmont Koblinger 2006-05-03 14:07:16 UTC
Created attachment 1000 [details]
phonebook

The page where ö and &#337; starts, scanned from a quite recent phonebook.
Comment 11 Egmont Koblinger 2006-05-03 14:13:10 UTC
I modified the scanned pictures due to potential privacy or legal problems.
I can send the unmodified versions in private e-mail, if required.

If you need any other proof, please let me know.
Comment 12 Egmont Koblinger 2006-11-16 11:52:02 UTC
No response in 6+ months.

Last time you closed this bug with this justification. Now _you_ haven't 
replied in half year, so let me please increase the severity (as requested in 
the help pages of this bugzilla -- though I admit this is not a critical bug at 
all, but somehow I'd like to draw your attention on it, and anyway your docs 
say I should do this).

It is a regression anyway (now already present in 2 consecuvite official 
releases), and I see no reason why it couldn't be fixed quickly. I hope that 
regression bugs are handled with higher priority (as this is the case with many 
other software projects).

In the mean time I also changed the summary according to the docs, HTH too.
Comment 13 Ulrich Drepper 2007-02-18 04:43:34 UTC
I reverted the patch.