This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug localedata/21547] Tibetan script collation broken (Dzongkha and Tibetan)
- From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Mon, 15 Jan 2018 14:35:05 +0000
- Subject: [Bug localedata/21547] Tibetan script collation broken (Dzongkha and Tibetan)
- Auto-submitted: auto-generated
- References: <bug-21547-131@http.sourceware.org/bugzilla/>
https://sourceware.org/bugzilla/show_bug.cgi?id=21547
--- Comment #7 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Elie Roux from comment #6)
> Hello Fabian,
>
> Thanks a lot for your thorough review, that's appreciated!
>
> I have to say I don't really understand the second part, why would line 30
> causes གཉ to be sorted after གཉྫ ? can you elaborate a little bit?
I am not sure why this happens either. But it seems to happen.
I tested like this:
First the input test file to be sorted, made very short to contain only the
strings
in question:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ cat localedata/dz_BT.UTF-8.in.mini
གཉ
གཉྫ
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
And I use a very short rule file, first containg only &གཉ<གཉྫ and
the other rule commented out:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ cat rules-mini.txt
&གཉ<གཉྫ
#&ཉ<<ྋྙ<གཉ<མཉ<རྙ=ཪྙ<སྙ<བརྙ=བཪྙ<བསྙ
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
Now I sort using my small test program ~/bin/icu-collation-test.py
(I’ll attach it in the next comment):
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ ~/bin/icu-collation-test.py -r rules-mini.txt -i
localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
And check the result:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini
/tmp/dz_BT.UTF-8.out
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
No difference between input and output, i.e. གཉ is still before གཉྫ
in dz_BT.UTF-8.out.
Now I remove the comment in front of the second line in rules-mini.txt:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ cat rules-mini.txt
&གཉ<གཉྫ
&ཉ<<ྋྙ<གཉ<མཉ<རྙ=ཪྙ<སྙ<བརྙ=བཪྙ<བསྙ
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
And sort again:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ ~/bin/icu-collation-test.py -r rules-mini.txt -i
localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
Checking the result:
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini
/tmp/dz_BT.UTF-8.out
--- /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini 2018-01-15
15:21:59.357477414 +0100
+++ /tmp/dz_BT.UTF-8.out 2018-01-15 15:26:12.266632745 +0100
@@ -1,2 +1,2 @@
-གཉ
གཉྫ
+གཉ
mfabian@taka:/local/mfabian/src/glibc (locales *$%)
$
Now the order is reversed, གཉ comes after གཉྫ.
The same happened to me while I was implementing the rules for glibc
and test sorting using glibc. I found this very confusing and thought
I might have done something wrong implementing the rules in the glibc
way. But then I tested with the above small Python3 program using icu
and found that it behaves the same way.
--
You are receiving this mail because:
You are on the CC list for the bug.