Bug 11837 - GB18030-2005 is not supported!
Summary: GB18030-2005 is not supported!
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: localedata (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: 2.16
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-24 13:01 UTC by Chia-Pao Kuo
Modified: 2014-07-01 12:41 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
carlos_odonell: review+
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chia-Pao Kuo 2010-07-24 13:01:09 UTC
GB18030 is a standard of encoding in China. The latest version of GB18030 is
GB18030-2005, which contains Unicode CJK Ext-B (for instance •2Õ8). However the
GB-18030 map of glib is still GB18030-2000.
Comment 1 Ulrich Drepper 2011-05-09 23:14:45 UTC
Well, and where is the data?
Comment 2 Ulrich Drepper 2011-05-17 05:43:14 UTC
I've checked in a patch.
Comment 3 Andreas Schwab 2011-06-14 12:47:04 UTC
That doesn't appear to work.

$ printf "\xf0\xa0\xb3\x90\n" | iconv -t gb18030
iconv: illegal input sequence at position 0
Comment 4 Ulrich Drepper 2011-07-07 03:55:35 UTC
(In reply to comment #3)
> That doesn't appear to work.
> 
> $ printf "\xf0\xa0\xb3\x90\n" | iconv -t gb18030
> iconv: illegal input sequence at position 0

That's expected.  Previous mappings were wrong.  The official GB18030 mapping doesn't define a mapping for U20cd0.
Comment 5 Andreas Schwab 2011-07-07 06:48:30 UTC
GB18030 defines a mapping for *every* Unicode character, even the unassigned/reserved ones.
Comment 6 Ulrich Drepper 2011-07-08 16:47:13 UTC
Nog(In reply to comment #5)
> GB18030 defines a mapping for *every* Unicode character, even the
> unassigned/reserved ones.

It says how they would be mapped.  But this is not what converters are supposed to do.  The only official mappings available don't do that.
Comment 7 Andreas Schwab 2011-07-11 07:16:22 UTC
GB18030 is defined to map every Unicode character.
Comment 8 Rich Felker 2011-07-16 00:44:36 UTC
GB18030 is defined to map not just every Unicode *character*, but every *Unicode Scalar Value*. That means every number in the ranges 0x0000-0xD7FF and 0xE000-0x10FFFF is mapped. This property is what makes it a true UTF and not merely a legacy DBCS.

Mr. Drepper, if you claim GB18030 should not successfully map unassigned codepoints, what about the converters between UTF-8, UTF-16, and UTF-32? Should they also reject unassigned codepoints? Despite being horribly ugly and having all the harmful properties of legacy DBCS, GB18030 is a UTF and should be treated the same as other UTFs.
Comment 9 An Yang 2011-08-06 17:16:22 UTC
The system can convert or display all of Chinese Characters in Unicode6.0 CJK Ext-A/B/C/D.

But glibc have a bug related with pinyin sort, it can NOT sort any characters in CJK Ext-A/B/C/D, it just drop all of them.

I'll file a new bug.
Comment 10 Ulrich Drepper 2011-10-29 17:18:38 UTC
Stop reopening this.  The canonical source for the conversion does exactly what the glibc code does.  Anything else does not have any value and only creates problems.
Comment 11 An Yang 2011-10-31 03:50:55 UTC
Hi Ulrich Drepper,

Take it easy.

I'm sure something is wrong in Fedora/RHEL and any other Linux which use glibc, please see http://sourceware.org/bugzilla/show_bug.cgi?id=13063, and make comments there.
Comment 12 Jun Huang 2011-11-21 14:37:19 UTC
Is it possible to rollback the commit ee30c380b8f7c9253c87103c58c5201268d30181 "Update GB18030 to 2005 version"? or maybe consider to cherry-pick the commit 2a57bd797c9a0f9d79436b8960019506c28c5889 "Repair GB18030 charmap" and commit 3d828a61cdc5ccd5e907e880cff45130169a543e "Fix more bugs in GB18030 charmap"? At least we need a workable version.
Comment 13 Li Yang 2011-11-25 17:45:22 UTC
In fact, it worked well before this change has been committed.
Comment 14 Bruno Haible 2012-01-26 16:25:49 UTC
As a result of this mess, openSUSE 12.1 is now shipping with yet another
GB18030 converter: the one by Anthony Fok <anthony@thizlinux.com>, 2002.
And it is broken as well: It cannot convert the character
U+C50B HANGUL SYLLABLE SSEUH to GB18030:

$ printf '\x00\x00\xc5\x0B' | LC_ALL=C /usr/bin/iconv -f UCS-4BE -t GB18030 | od -t x1 | head -n 1
/usr/bin/iconv: illegal input sequence at position 0
0000000

Expected output:

$ printf '\x00\x00\xc5\x0B' | LC_ALL=C /usr/bin/iconv -f UCS-4BE -t GB18030 | od -t x1 | head -n 1
0000000 83 32 da 36
Comment 15 Rich Felker 2012-01-26 17:33:45 UTC
> That's expected.  Previous mappings were wrong.  The official GB18030 mapping
> doesn't define a mapping for U20cd0.

This is false. The official GB18030 defines a mapping for every Unicode Scalar Value, as it is a UTF. Why do you refuse the simple, standards-conformant fix that would make all of these issues go away?
Comment 16 Carlos O'Donell 2012-05-09 12:34:11 UTC
OK, we want to get this fixed for 2.16. Setting milestone.

Andreas, Could you please post your patch to libc-alpha again, we'll have a quick review and then check it in as incremental progress. I'd like to see 2.16 have better support for GB18030.
Comment 17 Rich Felker 2012-05-09 14:38:30 UTC
This is yet another issue that Drepper refused to fix correctly. Can we please finally get WORKING support for GB18030 that treats it as a full UTF as specified by the standard and not just a mapping of assigned characters? This should also ensure that GB18030 support never needs fixes/maintenance again in the future (the whole point of being a UTF is that it's future-ready).
Comment 18 Carlos O'Donell 2012-05-10 20:23:28 UTC
I've reviewed Andreas' patch for this and I haven't found any problems, so it looks like we'll get this fixed for 2.16.
Comment 19 Andreas Schwab 2012-05-11 17:27:20 UTC
Fixed in 60cc4a1.
Comment 20 Jackie Rosen 2014-02-16 17:45:20 UTC Comment hidden (spam)