Summary: | GB18030-2005 is not supported! | ||
---|---|---|---|
Product: | glibc | Reporter: | Chia-Pao Kuo <byvoid.kcp> |
Component: | localedata | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | an.euroford, bruno, bugdal, carlos_odonell, drepper.fsp, glibc-bugs, liyangdal |
Priority: | P2 | Flags: | carlos_odonell:
review+
fweimer: security- |
Version: | unspecified | ||
Target Milestone: | 2.16 | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Chia-Pao Kuo
2010-07-24 13:01:09 UTC
Well, and where is the data? I've checked in a patch. That doesn't appear to work. $ printf "\xf0\xa0\xb3\x90\n" | iconv -t gb18030 iconv: illegal input sequence at position 0 (In reply to comment #3) > That doesn't appear to work. > > $ printf "\xf0\xa0\xb3\x90\n" | iconv -t gb18030 > iconv: illegal input sequence at position 0 That's expected. Previous mappings were wrong. The official GB18030 mapping doesn't define a mapping for U20cd0. GB18030 defines a mapping for *every* Unicode character, even the unassigned/reserved ones. Nog(In reply to comment #5) > GB18030 defines a mapping for *every* Unicode character, even the > unassigned/reserved ones. It says how they would be mapped. But this is not what converters are supposed to do. The only official mappings available don't do that. GB18030 is defined to map every Unicode character. GB18030 is defined to map not just every Unicode *character*, but every *Unicode Scalar Value*. That means every number in the ranges 0x0000-0xD7FF and 0xE000-0x10FFFF is mapped. This property is what makes it a true UTF and not merely a legacy DBCS. Mr. Drepper, if you claim GB18030 should not successfully map unassigned codepoints, what about the converters between UTF-8, UTF-16, and UTF-32? Should they also reject unassigned codepoints? Despite being horribly ugly and having all the harmful properties of legacy DBCS, GB18030 is a UTF and should be treated the same as other UTFs. The system can convert or display all of Chinese Characters in Unicode6.0 CJK Ext-A/B/C/D. But glibc have a bug related with pinyin sort, it can NOT sort any characters in CJK Ext-A/B/C/D, it just drop all of them. I'll file a new bug. Stop reopening this. The canonical source for the conversion does exactly what the glibc code does. Anything else does not have any value and only creates problems. Hi Ulrich Drepper, Take it easy. I'm sure something is wrong in Fedora/RHEL and any other Linux which use glibc, please see http://sourceware.org/bugzilla/show_bug.cgi?id=13063, and make comments there. Is it possible to rollback the commit ee30c380b8f7c9253c87103c58c5201268d30181 "Update GB18030 to 2005 version"? or maybe consider to cherry-pick the commit 2a57bd797c9a0f9d79436b8960019506c28c5889 "Repair GB18030 charmap" and commit 3d828a61cdc5ccd5e907e880cff45130169a543e "Fix more bugs in GB18030 charmap"? At least we need a workable version. In fact, it worked well before this change has been committed. As a result of this mess, openSUSE 12.1 is now shipping with yet another GB18030 converter: the one by Anthony Fok <anthony@thizlinux.com>, 2002. And it is broken as well: It cannot convert the character U+C50B HANGUL SYLLABLE SSEUH to GB18030: $ printf '\x00\x00\xc5\x0B' | LC_ALL=C /usr/bin/iconv -f UCS-4BE -t GB18030 | od -t x1 | head -n 1 /usr/bin/iconv: illegal input sequence at position 0 0000000 Expected output: $ printf '\x00\x00\xc5\x0B' | LC_ALL=C /usr/bin/iconv -f UCS-4BE -t GB18030 | od -t x1 | head -n 1 0000000 83 32 da 36 > That's expected. Previous mappings were wrong. The official GB18030 mapping
> doesn't define a mapping for U20cd0.
This is false. The official GB18030 defines a mapping for every Unicode Scalar Value, as it is a UTF. Why do you refuse the simple, standards-conformant fix that would make all of these issues go away?
OK, we want to get this fixed for 2.16. Setting milestone. Andreas, Could you please post your patch to libc-alpha again, we'll have a quick review and then check it in as incremental progress. I'd like to see 2.16 have better support for GB18030. This is yet another issue that Drepper refused to fix correctly. Can we please finally get WORKING support for GB18030 that treats it as a full UTF as specified by the standard and not just a mapping of assigned characters? This should also ensure that GB18030 support never needs fixes/maintenance again in the future (the whole point of being a UTF is that it's future-ready). I've reviewed Andreas' patch for this and I haven't found any problems, so it looks like we'll get this fixed for 2.16. Fixed in 60cc4a1. *** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla. |