[PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.

Wed Feb 1 01:56:00 GMT 2017

Hi Carlos,

> Thank you for your patience. Perhaps the best way to restart this conversation
> is to cover what, if any, review, the changes have received and reference old
> discussions about them.

I've described/linked all the individual bugs from this "meta"
bugzilla bug #18934. You can also look up this thread in the mailing
list archive, although I doubt there was too much additional
information there, I tried to make sure every important piece of
information is present in bugzilla.

> * How does this compare to CLDR?

Unfortunately I have no information whatsoever about CLDR's Hungarian
collation implementation.

As much as it would be great to make sure both versions are correct
and pass this unittest, unfortunately CLDR is outside of my personal
radar of interest. I hope they'll notice our unittests and adjust
their implementation accordingly if required. Or we could just file an
"FYI-bug" against them to get them look after it.

(When I originally created this patch, you could have probably
convinced me to take a look at CLDR too. Nowadays (i.e. for the next
couple of years) I have an extremely limited free time due to personal
reasons. I just pretty much want to close out the pending issues (like
this one). I don't have time to pick up any nontrivial new task.)

> * Does the regression test pass?

What do you exactly mean by "regression tests"? There was no unittest
for hu_HU previously, the newly created one obviously passes during a
"make tests". I suspect you're referring to something else; if so
please clarify.

> * What kind of consequences might this have on existing programs?

There's nothing brand new, nothing "big" change in the collation
order. From the users' point of view, it's really a few "small" fixes
of a few rare corner cases.

(Let me give an interesting example. After 30 years, a new standard
for the Hungarian grammar rules was released in Sep 2015. The previous
one did not specify the collation order of uppercase and lowercase
counterpart of the same letter. The new one does. Accidentally,
however, we did not need to change it, the old implementation just
happens to be the one that's specified in the standard now. Had it
been the other way around, it would probably be a noticeable "big"
change.)

So, the fixes only revolve around a bit more special cases, effect
only a tiny subset of the actual words or artificially made-up
strings.

I recommend that you sort the unittest file with the old locale
definition (take care to remove the comments and trailing spaces if
you do it "manually" with "sort" rather than with glibc's "make
tests") and see the diff. Especially at the first part (the examples
from the official rules, rather than my tests which focus on the
corner cases) you won't see too much difference.

> * Can you find a Hungarian speaker to review and validate your changes?

I'm the person who contributed the last perhaps 5 or 6 (maybe even
more) changes to the locale file, some of them improving the
collation, some touching other parts. I also admit in the "meta" bug
that one of the changes did introduce a regression that I did not
notice then; I fixed it now. None of those previous changes were
backed up by any tests. The new ones are, and me having introduced a
regression was a huge motivation for creating these tests.

As linked from the meta bug, someone introduced a change that broke
many locales, including Hungarian. I really doubt he was asked to get
his work peer reviewed. In fact, this is still an open issue nobody
cares about!!!

I remember many-many years ago some random Hungarian guy came along,
submitted a patch to the collation definition which got accepted.
Turned out, he implemented his personal favorite rather than the
standard. Then I had to prove by scanning pages from dictionaries that
he was wrong with the sorting order to get it reverted. (I'm lazy to
look up pointers, sorry.)

I'm pondering... seems to me that in your project if someone comes
along and just changes something without explanation, you accept it;
but if he gives quite a lot of proof about his work's quality then you
ask for even more??

Please take a look at the comments of the new unittest. It links to
the official online version of the Hungarian grammar rules, gives a
short summary about each collation rule (because Hungarian is such a
weird language that you don't have much chance to understand what
Google/Bing Translate says), and copies all the examples from there.
You're free to verify that I've copied them correctly. Plus I add a
whole lot more which are also explained in comments. Note that the
basic collation rules are explained in the locale definition file
itself, I haven't changed them and they're consistent with what I say
in the unittests.

I'm sorry but I don't know any Hungarian guy who has any insight into
these locale definitions to do a peer review (other than the one who
implemented his personal favorite - I wouldn't trust him).

Seriously, please take a close look at the "meta" bug, the individual
bugs linked from there, and the new unittest itself and the comments
within. Please let me know if they are not convincing enough. I'm not
claiming that the new version is 100% guaranteed to be bug-free
(although I sure hope so). I'm saying it's obviously significantly
better than the previous one, and the unittests provide a solid
grounds for further improving it without possibly introducing a
regression, should there still be any bugs to fix.

Thanks a lot,
egmont