|Summary:||'sort -u' will erase some Chinese characters|
|Product:||glibc||Reporter:||An Yang <an.euroford>|
|Component:||localedata||Assignee:||GNU C Library Locale Maintainers <libc-locales>|
|Severity:||critical||CC:||arthur200126, bluebat, maiku.fabian|
|Attachments:||example characters in CJK extension A.|
Description An Yang 2011-08-06 17:21:16 UTC
Hi, Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or iso14651_t1, glibc just support unicode3.0. The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with extension A/B/C/D, and extension A is included in GB18030:2005( China locale charset standard). So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH and EXTENSIONA(U+3400-U+4DBF). The real effect is sort -u. If you execute sort -u examples_CJK_extensionA.txt (see attachment), you will got only one Chinese character "㑗". Regards, An Yang
Comment 1 An Yang 2011-08-06 17:24:33 UTC
Created attachment 5880 [details] example characters in CJK extension A.
Comment 2 An Yang 2011-08-07 17:42:44 UTC
I'm not sure, this bugs has any relationship with charmaps, maybe or may not. But the value of LC_COLLATE in zh_CN is: % ISO 14651 collation sequence LC_COLLATE copy "iso14651_t1_pinyin" END LC_COLLATE I'm sure, something is wrong in this table. All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D.
Comment 3 An Yang 2011-08-08 16:54:28 UTC
There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A. But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted. So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode. Could you give me any feedback?
Comment 4 Wei-Lun Chao 2014-05-07 08:19:11 UTC
as BZ#15616 report confirmed. BZ#16905 is another approach but untested.
Comment 6 Mingye Wang 2017-01-22 23:56:17 UTC
This bug is not only seen with extA characters, but also seen with simple punctuations and/or kanas. $ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort -u , : . $ ， a b c (uniq does the same thing.) It seems that glibc is just eating away anything not on that list. (What kind of equivalence assumption is that?)
Comment 7 Mike FABIAN 2017-07-20 08:01:58 UTC
(In reply to Mingye Wang from comment #6) > This bug is not only seen with extA characters, but also seen with simple > punctuations and/or kanas. > > $ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort > -u > , > : > . > $ > ， > a > b > c > > (uniq does the same thing.) > > It seems that glibc is just eating away anything not on that list. (What > kind of equivalence assumption is that?) This is caused by the collation symbol UNDEFINED not working correctly, see: https://sourceware.org/bugzilla/show_bug.cgi?id=18978