13063 – 'sort -u' will erase some Chinese characters

Bug 13063 - 'sort -u' will erase some Chinese characters

Summary: 'sort -u' will erase some Chinese characters

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	unspecified

Importance:	P2 critical
Target Milestone:	---
Assignee:	GNU C Library Locale Maintainers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-06 17:21 UTC by An Yang
Modified:	2017-07-20 08:01 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
example characters in CJK extension A. (1.48 KB, text/plain) 2011-08-06 17:24 UTC, An Yang	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description An Yang 2011-08-06 17:21:16 UTC

Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".


Regards,
An Yang

Comment 1 An Yang 2011-08-06 17:24:33 UTC

Created attachment 5880 [details]
example characters in CJK extension A.

Comment 2 An Yang 2011-08-07 17:42:44 UTC

I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D.

Comment 3 An Yang 2011-08-08 16:54:28 UTC

There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

Comment 4 Wei-Lun Chao 2014-05-07 08:19:11 UTC

as BZ#15616 report confirmed.
BZ#16905 is another approach but untested.

Comment 5 Wei-Lun Chao 2014-11-19 04:13:12 UTC

Tested with patch from bug 17563 and get pass.

Comment 6 Mingye Wang 2017-01-22 23:56:17 UTC

This bug is not only seen with extA characters, but also seen with simple punctuations and/or kanas. 

$ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort -u
,
:
.
$
，
a
b
c

(uniq does the same thing.)

It seems that glibc is just eating away anything not on that list. (What kind of equivalence assumption is that?)

Comment 7 Mike FABIAN 2017-07-20 08:01:58 UTC

(In reply to Mingye Wang from comment #6)
> This bug is not only seen with extA characters, but also seen with simple
> punctuations and/or kanas. 
> 
> $ printf '%s\n' ， 。 ： ￥ あ か ア カ a b c , . : $ | LC_COLLATE=zh_CN.UTF-8 sort
> -u
> ,
> :
> .
> $
> ，
> a
> b
> c
> 
> (uniq does the same thing.)
> 
> It seems that glibc is just eating away anything not on that list. (What
> kind of equivalence assumption is that?)

This is caused by the collation symbol UNDEFINED not working correctly,
see:

https://sourceware.org/bugzilla/show_bug.cgi?id=18978