This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug localedata/21547] Tibetan script collation broken (Dzongkha and Tibetan)

From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Mon, 15 Jan 2018 14:35:05 +0000
Subject: [Bug localedata/21547] Tibetan script collation broken (Dzongkha and Tibetan)
Auto-submitted: auto-generated
References: <bug-21547-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=21547

--- Comment #7 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Elie Roux from comment #6)
> Hello Fabian,
> 
> Thanks a lot for your thorough review, that's appreciated!
> 
> I have to say I don't really understand the second part, why would line 30
> causes གཉ to be sorted after གཉྫ ? can you elaborate a little bit?

I am not sure why this happens either. But it seems to happen.
I tested like this:

First the input test file to be sorted, made very short to contain only the
strings
in question:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ cat localedata/dz_BT.UTF-8.in.mini
    གཉ
    གཉྫ
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

And I use a very short rule file, first containg only &གཉ<གཉྫ  and
the other rule commented out:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ cat rules-mini.txt
    &གཉ<གཉྫ
    #&ཉ<<ྋྙ<གཉ<མཉ<རྙ=ཪྙ<སྙ<བརྙ=བཪྙ<བསྙ
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

Now I sort using my small test program ~/bin/icu-collation-test.py
(I’ll attach it in the next comment):

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ ~/bin/icu-collation-test.py -r rules-mini.txt -i
localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out 
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

And check the result:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini
/tmp/dz_BT.UTF-8.out 
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

No difference between input and output, i.e.  གཉ is still before གཉྫ
in dz_BT.UTF-8.out.

Now I remove  the  comment in front  of the second line in rules-mini.txt:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ cat rules-mini.txt
    &གཉ<གཉྫ
    &ཉ<<ྋྙ<གཉ<མཉ<རྙ=ཪྙ<སྙ<བརྙ=བཪྙ<བསྙ
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

And sort again:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ ~/bin/icu-collation-test.py -r rules-mini.txt -i
localedata/dz_BT.UTF-8.in.mini -o /tmp/dz_BT.UTF-8.out 
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)

Checking the result:

    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $ diff -u /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini
/tmp/dz_BT.UTF-8.out 
    --- /local/mfabian/src/glibc/localedata/dz_BT.UTF-8.in.mini 2018-01-15
15:21:59.357477414 +0100
    +++ /tmp/dz_BT.UTF-8.out    2018-01-15 15:26:12.266632745 +0100
    @@ -1,2 +1,2 @@
    -གཉ
     གཉྫ
    +གཉ
    mfabian@taka:/local/mfabian/src/glibc (locales *$%)
    $

Now the order is reversed, གཉ comes after གཉྫ.

The same happened to me while I was implementing the rules for glibc
and test sorting using glibc.  I found this very confusing and thought
I might have done something wrong implementing the rules in the glibc
way.  But then I tested with the above small Python3 program using icu
and found that it behaves the same way.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]