This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.

From: Egmont Koblinger <egmont at gmail dot com>
To: Luis Javier Merino <ninjalj at gmail dot com>
Cc: "Carlos O'Donell" <carlos at redhat dot com>, libc-locales <libc-locales at sourceware dot org>
Date: Thu, 2 Feb 2017 01:04:03 +0100
Subject: Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
Authentication-results: sourceware.org; auth=none
References: <CAGWcZkLbhdJWRZLDKHXrHf2875pKLushYJon7YusGu=zhpO7mQ@mail.gmail.com> <CAGWcZkLsGUcfmw6X4VT7sWZX5juh5WFkJe=ChV+K2myjDmbuEA@mail.gmail.com> <20160421061349.GM5369@vapier.lan> <CAGWcZkK=UXGDEG6moxcyG9PJz4D=V=kVR6G1u=uhSFqgu+m+oA@mail.gmail.com> <CAGWcZkLyq8XJ5utRbZ6A58BhpdZdhrAi7m-TGa_W367ymKo+4g@mail.gmail.com> <bdc2fe9f-6700-99ae-aae9-87be54b8911e@redhat.com> <CAGWcZkJ_m26UxF=+P7U-Kdw+6msvTE_e=TQNOt-_F1zihjheAQ@mail.gmail.com> <CABjvSdgNJRDUNBOExm9=Sgyydre4jLyV9hdp+6Z-kom-y9jKOw@mail.gmail.com>

Hi Luis, others,

TLDR:

Nice inversigation from someone not speaking our language, thumbs up :)

Your observations are all correct. I'm extending them with more
examples and explanation. Note that my patch does not change the
intent how tokenization should happen, basically whatever you show is
what is meant to be implemented now. There are just bugs in its
implementation which I'm fixing (and, as you point out, there'll still
remain bugs due to not adding a dictionary for the ambiguous cases).

The differences between ICU and my version: Yup, they are not
specified in the standard, and are about artificially made up strings
that don't occur in Hungarian text. I give some rationale why I chose
the way I chose, but ICU is not wrong at all either.


Long version:

> I did some investigation of Hungarian collation for a code golf at
> http://codegolf.stackexchange.com/a/75599/267
>
> Hungarian has digraphs and trigraphs (cs, dz, dzs, gy, ly, ny, sz, ty,
> zs). It also has geminated (long) consonants, which are represented by
> writing the consonant twice. In the case of digraphs/trigraphs, they
> can be written in a long (duplicate the whole digraph/trigraph) and
> short form (duplicate only the first consonant of the
> digraph/trigraph).

This is absolutely correct. (I'm also happy to learn the proper
terminology from you.)

Note: It's not up to the writer to freely choose between the two
forms. The long form must be used at compound word boundaries, e.g.
(see both in your stackexchange page and in my unittests) "nagy" [big]
+ "gyakorlat" [excercise] becomes "nagygyakorlat". The shorthand form
must be used otherwise, e.g. "naggyal" [with big], the agglutinative
suffix ("gyal" [with] in this particular case) does not count as a
separate word to form a compound word with.

> Not all occurrences of the consonants in a digraph/trigraph represent
> a digraph/trigraph, e.g: in házszám zs doesn't represent a digraph,
> but sz does. This means you need a dictionary or similar to get a
> (nearly) fully correct collation. IIRC, LibreOffice uses libhnj, which
> uses rules derived from a dictionary.

Again, this is correct. Combinations such as "zsz" require knowing the
language to tell wither it's z+sz or zs+z. Someone not speaking the
language would probably guess it right with a 50-50 chance.

Even simple diagraphs are ambiguous and require knowing the language,
e.g. the words "pácsó" or "malacsült" are compound words at the
boundary between c and s, it's not a cs diagraph.

Another interesting ambiguous case is "ssz", this can stand for s+sz
or sz+sz. Example: "karosszék" [armchair] is "kar" [arm] -> "karos"
[something with an arm] + "szék" [chair], hence it's s+sz. For
"karosszéria" [car's body/chassis] one could think it's coming from
"karos" + "széria" [series], but this doesn't make any sense. It's
probably coming from Italian "carrozzeria", hence it's sz+sz.
(Obviously the pronounciation is also different in these two cases.)

The current implementation is eager, always tries to combine as many
glyphs as possible to form a short or long diagraph or trigraph. As
such, it erroneously tokenizes "házszám" as h+á+zs+z+á+m instead of
h+á+z+sz+á+m, "pácsó" as p+á+cs+ó instead of p+á+c+s+ó, etc. This has
been the case (except for bugs maybe, so let's rather say this has
been the clear intent) probably even before I first touched the locale
definition, and is still the intent.

I have no plans to add dictionary of exception words to the glibc
locale definition, nor to analyze the frequencies (e.g. probably "zsz"
more often stands for z+sz than zs+z, yet glibc goes for the latter).
The current rules are good enough in the sense that they mistokenize
only a tiny, almost negligible subset of words, and even when they do,
the chance of this resulting in swapping order with another word is
even much smaller.

> These are the differences I noticed between Egmont's testsuite and ICU:
>
>  - Egmont collates the short forms before the full forms (ssz < szsz,
> ..., zzs < zszs ), ICU collates the long forms before the short forms
> starting at L3 Case and Variants (szsz <3 ssz, ..., zszs <3 zzs ). I
> don't think that is specified in the grammar rules, but I can't read
> Hungarian.

(I'm not sure what ICU is and what's its relation to CLDR. Nevermind.)

You are correct that this is not specified in the orthography rules.
This is probably because there are no actual words that do make sense
with both ways of spelling.

It would cause problems if e.g. someone invented a new word
"karosszéria" in the meaning [series with arms]. It could have caused
problems for a year after the release of the 12th version of the rules
with the word "ész" [mind] + "szerű" [-like, -ish] = "ésszerű"
[rational, reasonable] according to the previous standard ("szerű"
used to be considered an agglutinative suffix), but now is spelled
"észszerű" ("szerű" is now considered a standalone word, so it has
become a compound word). For a year both the old and the new versions
of the standard were valid. This one year has already elapsed, making
the previous spelling incorrect.

In an earlier version of glibc, ssz and szsz used to collate equally,
causing problems for some users and some software. See bug 13547, with
a further pointer where I found this problem being reported. The
reported ran sort and uniq on files that contained lines similar to
ZZSZSSPKKPKP and found that uniq removed some unique lines. (This is
where my fix unfortunately introduced a regression which I'm also
fixing now.)

I decided on ssz < szsz along this reasoning: I was thinking that as
per "karosszék" vs. "karosszéria" above, if you can't tell for sure
whether it's s+sz or sz+sz, let's sort in between the two that are
known for sure. To prove my point, let's make up two compound words:
"kés" [knife] + "szerű" and "kész" [ready] + "szerű". The correct
tokenization is obviously s+sz and sz+sz respectively, and hence this
is the required alphabetical ordering. My new version tokenizes both
as sz+sz, yet a weaker property ends up sorting them correctly. It
breaks, however, as soon as you continue the first word with another
agglutinative suffix. ICU sorts them incorrectly right away. Since
we're already in the gray zone of ambiguous, easily mistokenizable
words and rare, artificially constructed examples, I cannot say that
ICU's approach is wrong at all.

>  - ICU treats weirdly capitalized groups as
> non-contractions/non-digraphs/non-trigraphs, e.g: ccS <3 CcS <3 cCs <3
> cCS <3 CCs <3 cS <3 cs <3 Cs <3 CS <3 ccs <3 Ccs <3 CCS

Yet again an absolutely forced corner case that does not happen with real words.

The official rules [akh12 - link below] bullet points 3 and 8 show
that in all-uppercase context all the letters of diagraphs/trigraphs
become uppercase. Abbreviations and similar constructs are detailed in
276-289. With very few exceptions, the examples (in bold) are either
all lowercase, or an initial uppercase followed by all lowercase, or
all uppercase (at least up to the hyphen which attaches agglutinative
suffixes). The few exceptions are units (e.g. kB, kWh), ÉNy DNy
[northwest, southwest], that's pretty much it. So "weirdly capitalized
groups", as you say, really don't matter in practice.

I see rationale in what ICU does, but it also imposes some questions.
E.g. no Hungarian word starts with geminated (long) consonant. So then
shouldn't "Ccs" be tokenized as C+c+s? Or C+cs? Whereas "CCS" and
"ccs" should still be tokenized as CS+CS or cs+cs because those can
easily appear in the middle of words.

Again it's a gray zone, I went for the one that's simpler, technically
cleaner, provides a nicer structure in the definition file as well as
the tests etc, but I can't say the other approach is wrong.

Again, it's not specified in the standard, is of marginal (if any)
practical use, and I did not conceptually change glibc's behavior,
just fixed bugs/inconsistencies in its previous implementations.


Note: You've covered the collation of consonants, not the vowels.
That's another (simpler, almost unambiguous) story.


Thanks a lot,
egmont

[akh12] http://helyesiras.mta.hu/helyesiras/default/akh12

Follow-Ups:
- Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
  - From: Carlos O'Donell

References:
- Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
  - From: Egmont Koblinger
- Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
  - From: Carlos O'Donell
- Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
  - From: Egmont Koblinger
- Re: [PATCH][BZ 18934] hu_HU: Fix multiple sorting bugs.
  - From: Luis Javier Merino

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]