Created attachment 10854 [details] 0001-Test-patch-to-show-that-some-Chinese-characters-cann.patch Some Chinese characters cannot be sorted by adding collation rules to LC_COLLATE. For example: 𫡅 U+2B845 cannot be sorted but 𠮞 U+20B9E can be sorted. The attached patch demonstrates this problem.
diff --git a/localedata/en_GB.UTF-8.in b/localedata/en_GB.UTF-8.in new file mode 100644 index 0000000000..b365767bac --- /dev/null +++ b/localedata/en_GB.UTF-8.in @@ -0,0 +1,10 @@ +a +A +ĉ +Ĉ +𠮞 ; <U00020B9E> +𫡅 ; <U0002B845> So the test file expects U+2B845 to be sorted at this position. +b +B +c +C diff --git a/localedata/locales/en_GB b/localedata/locales/en_GB index 5b895574ac..e114a3a440 100644 --- a/localedata/locales/en_GB +++ b/localedata/locales/en_GB @@ -60,6 +60,19 @@ END LC_CTYPE LC_COLLATE % Copy the template from ISO/IEC 14651 copy "iso14651_t1" + +collating-symbol <ccirc> + +reorder-after <AFTER-A> +<ccirc> + +<U0108> <ccirc>;<BASE>;<CAP>;<U0108> +<U0109> <ccirc>;<BASE>;<MIN>;<U0109> +<U00020B9E> <ccirc>;<BASE>;<CAP>;<U00020B9E> +<U0002B845> <ccirc>;<BASE>;<CAP>;<U0002B845> Here we have a rule to sort U+2B845 like the collation symbol <ccirc> which is reordered after the Latin letter a. + +reorder-end + END LC_COLLATE LC_MONETARY But when running "make check" one gets: $ grep ^FAIL tests.sum FAIL: localedata/sort-test And the test output contains: en_GB.UTF-8 collate-test FAIL --- en_GB.UTF-8.in 2018-02-26 10:53:50.810558237 +0100 +++ /local/mfabian/src/glibc-build/localedata/en_GB.UTF-8.out 2018-02-26 13:36:16.922398151 +0100 @@ -1,9 +1,9 @@ +𫡅 ; <U0002B845> a A ĉ Ĉ 𠮞 ; <U00020B9E> -𫡅 ; <U0002B845> b B c So U+20B9E is sorted as expected but U+2B845 is not. U+2B845 is sorted as if there were not rules at all for this character. Therefore, it ends up before a.
(In reply to Mike FABIAN from comment #0) > Created attachment 10854 [details] > 0001-Test-patch-to-show-that-some-Chinese-characters-cann.patch > > Some Chinese characters cannot be sorted by adding collation rules to > LC_COLLATE. > > For example: > > 𫡅 U+2B845 > > cannot be sorted but > > 𠮞 U+20B9E > > can be sorted. > > The attached patch demonstrates this problem. In the C.UTF-8 work I've found at least 3 more instances like this. Something is wrong with the parser or with the input expected by the parser. I will have to debug this along with the other failures in CJK symbols I've seen when I expand C.UTF-8 to the full code point set.