Bug 22898

Summary: Some Chinese characters cannot be sorted by adding sorting rules to LC_COLLATE
Product: glibc Reporter: Mike FABIAN <maiku.fabian>
Component: localeAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: carlos, codonell
Priority: P2 Flags: fweimer: security-
Version: 2.27   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: 0001-Test-patch-to-show-that-some-Chinese-characters-cann.patch

Description Mike FABIAN 2018-02-26 13:07:54 UTC
Created attachment 10854 [details]
0001-Test-patch-to-show-that-some-Chinese-characters-cann.patch

Some Chinese characters cannot be sorted by adding collation rules to LC_COLLATE.

For example:

𫡅 U+2B845

cannot be sorted but

𠮞 U+20B9E

can be sorted.

The attached patch demonstrates this problem.
Comment 1 Mike FABIAN 2018-02-26 13:13:49 UTC
    diff --git a/localedata/en_GB.UTF-8.in b/localedata/en_GB.UTF-8.in
    new file mode 100644
    index 0000000000..b365767bac
    --- /dev/null
    +++ b/localedata/en_GB.UTF-8.in
    @@ -0,0 +1,10 @@
    +a
    +A
    +ĉ
    +Ĉ
    +𠮞 ; <U00020B9E>
    +𫡅 ; <U0002B845>

So the test file expects U+2B845 to be sorted at this position.

    +b
    +B
    +c
    +C
    diff --git a/localedata/locales/en_GB b/localedata/locales/en_GB
    index 5b895574ac..e114a3a440 100644
    --- a/localedata/locales/en_GB
    +++ b/localedata/locales/en_GB
    @@ -60,6 +60,19 @@ END LC_CTYPE
     LC_COLLATE
     % Copy the template from ISO/IEC 14651
     copy "iso14651_t1"
    +
    +collating-symbol <ccirc>
    +
    +reorder-after <AFTER-A>
    +<ccirc>
    +
    +<U0108> <ccirc>;<BASE>;<CAP>;<U0108>
    +<U0109> <ccirc>;<BASE>;<MIN>;<U0109>
    +<U00020B9E> <ccirc>;<BASE>;<CAP>;<U00020B9E>
    +<U0002B845> <ccirc>;<BASE>;<CAP>;<U0002B845>

Here we have a rule to sort U+2B845 like the collation symbol <ccirc> which is reordered
after the Latin letter a.

    +
    +reorder-end
    +
     END LC_COLLATE

     LC_MONETARY

But when running "make check" one gets:

    $ grep ^FAIL tests.sum 
    FAIL: localedata/sort-test

And the test output contains:



en_GB.UTF-8 collate-test FAIL
  --- en_GB.UTF-8.in    2018-02-26 10:53:50.810558237 +0100
  +++ /local/mfabian/src/glibc-build/localedata/en_GB.UTF-8.out 2018-02-26 13:36:16.922398151 +0100
  @@ -1,9 +1,9 @@
  +𫡅 ; <U0002B845>
   a
   A
   ĉ
   Ĉ
   𠮞 ; <U00020B9E>
  -𫡅 ; <U0002B845>
   b
   B
   c

So U+20B9E is sorted as expected but U+2B845 is not. U+2B845 is sorted as if there
were not rules at all for this character. Therefore, it ends up before a.
Comment 2 Carlos O'Donell 2018-02-27 16:41:24 UTC
(In reply to Mike FABIAN from comment #0)
> Created attachment 10854 [details]
> 0001-Test-patch-to-show-that-some-Chinese-characters-cann.patch
> 
> Some Chinese characters cannot be sorted by adding collation rules to
> LC_COLLATE.
> 
> For example:
> 
> 𫡅 U+2B845
> 
> cannot be sorted but
> 
> 𠮞 U+20B9E
> 
> can be sorted.
> 
> The attached patch demonstrates this problem.

In the C.UTF-8 work I've found at least 3 more instances like this. Something is wrong with the parser or with the input expected by the parser. I will have to debug this along with the other failures in CJK symbols I've seen when I expand C.UTF-8 to the full code point set.