Summary: | 'sort -u' will erase some Chinese characters | ||
---|---|---|---|
Product: | glibc | Reporter: | An Yang <an.euroford> |
Component: | localedata | Assignee: | GNU C Library Locale Maintainers <libc-locales> |
Status: | NEW --- | ||
Severity: | critical | CC: | arthur200126, bluebat, maiku.fabian |
Priority: | P2 | Flags: | fweimer:
security-
|
Version: | unspecified | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | ||
Attachments: | example characters in CJK extension A. |
Description
An Yang
2011-08-06 17:21:16 UTC
Created attachment 5880 [details]
example characters in CJK extension A.
I'm not sure, this bugs has any relationship with charmaps, maybe or may not. But the value of LC_COLLATE in zh_CN is: % ISO 14651 collation sequence LC_COLLATE copy "iso14651_t1_pinyin" END LC_COLLATE I'm sure, something is wrong in this table. All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D. There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A. But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted. So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode. Could you give me any feedback? as BZ#15616 report confirmed. BZ#16905 is another approach but untested. This bug is not only seen with extA characters, but also seen with simple punctuations and/or kanas. $ printf '%s\n' , 。 : ¥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort -u , : . $ , a b c (uniq does the same thing.) It seems that glibc is just eating away anything not on that list. (What kind of equivalence assumption is that?) (In reply to Mingye Wang from comment #6) > This bug is not only seen with extA characters, but also seen with simple > punctuations and/or kanas. > > $ printf '%s\n' , 。 : ¥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort > -u > , > : > . > $ > , > a > b > c > > (uniq does the same thing.) > > It seems that glibc is just eating away anything not on that list. (What > kind of equivalence assumption is that?) This is caused by the collation symbol UNDEFINED not working correctly, see: https://sourceware.org/bugzilla/show_bug.cgi?id=18978 |