Bug 13063

Summary:	'sort -u' will erase some Chinese characters
Product:	glibc	Reporter:	An Yang <an.euroford>
Component:	localedata	Assignee:	GNU C Library Locale Maintainers <libc-locales>
Status:	NEW ---
Severity:	critical	CC:	arthur200126, bluebat, maiku.fabian
Priority:	P2	Flags:	fweimer: security-
Version:	unspecified
Target Milestone:	---
Host:		Target:
Build:		Last reconfirmed:
Attachments:	example characters in CJK extension A.

Description An Yang 2011-08-06 17:21:16 UTC

Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".


Regards,
An Yang

Comment 1 An Yang 2011-08-06 17:24:33 UTC

Created attachment 5880 [details]
example characters in CJK extension A.

Comment 2 An Yang 2011-08-07 17:42:44 UTC

I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D.

Comment 3 An Yang 2011-08-08 16:54:28 UTC

There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

Comment 4 Wei-Lun Chao 2014-05-07 08:19:11 UTC

as BZ#15616 report confirmed.
BZ#16905 is another approach but untested.

Comment 5 Wei-Lun Chao 2014-11-19 04:13:12 UTC

Tested with patch from bug 17563 and get pass.

Comment 6 Mingye Wang 2017-01-22 23:56:17 UTC

This bug is not only seen with extA characters, but also seen with simple punctuations and/or kanas. 

$ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort -u
,
:
.
$
，
a
b
c

(uniq does the same thing.)

It seems that glibc is just eating away anything not on that list. (What kind of equivalence assumption is that?)

Comment 7 Mike FABIAN 2017-07-20 08:01:58 UTC

(In reply to Mingye Wang from comment #6)
> This bug is not only seen with extA characters, but also seen with simple
> punctuations and/or kanas. 
> 
> $ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort
> -u
> ,
> :
> .
> $
> ，
> a
> b
> c
> 
> (uniq does the same thing.)
> 
> It seems that glibc is just eating away anything not on that list. (What
> kind of equivalence assumption is that?)

This is caused by the collation symbol UNDEFINED not working correctly,
see:

https://sourceware.org/bugzilla/show_bug.cgi?id=18978