This is the mail archive of the
mailing list for the glibc project.
Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- From: Anthony Fok <anthony at thizlinux dot com>
- To: Ulrich Drepper <drepper at redhat dot com>
- Cc: libc-alpha at sources dot redhat dot com, Kevin Lau <kevin at thizlinux dot com>,Fai <fai at thizlinux dot com>, Sunny Gu <sunnygu at thizgroup dot com>,James Su <suzhe at gnuchina dot org>, yshao at redhat dot com,Markus Scherer <markus dot scherer at us dot ibm dot com>,Bruno Haible <haible at ilog dot fr>
- Date: Fri, 18 Jan 2002 03:22:40 +0800
- Subject: Re: New GB18030 gconv module contributed by ThizLinux Laboratory
- References: <20020116074546.GA17279@sunrise> <firstname.lastname@example.org> <20020117100203.GB23149@sunrise> <email@example.com>
On Thu, Jan 17, 2002 at 09:08:08AM -0800, Ulrich Drepper wrote:
> The problems you see are almost certainly stemming from the fact that
> the tables you use are containing invalid code positions. If a
> character is not defined in Unicode/ISO 10646 the converted must not
> accept it. The GB18030 standard might already define what happens if
> these code positions appear in a source but this can only mean that
> they are prepared for the time when these code positions are defined.
As mentioned in the previous message, "undefined" != "invalid".
Or, using the terminology used by Unicode Consortium,
"unassigned" != "illegal". Unicode Technical Report #22 Character Mapping
Markup Language (CharMapML) has more information on this:
Besides explaining the difference and examples between illegal and
unassigned, the author also notes that:
Especially because unassigned character may actually come from a more
recent version of the character encoding, it is often important to
preserve round-trip mappings if possible.
> Yu Shao had all these positions defined in his first version and I
> assume all the provided test files were accepted. I had to send him
> back and redo everything so that only character which appear in
> are accepted. All other characters are invalid.
All of U+0000..U+D7FF, U+E000..U+FFFE are legal (i.e. valid), whether
they are currently assigned or not.
Also, look at gb-18030-2000.xml, which is the _official_ mapping data of
Unicode<->GB18030, prepared by Markus Scherer and other Unicode Consortium
> Look at your converter. If it does anything different it needs to be
> fixed. And once this is done I hope it does the same as the converted
> which I added yesterday.
Think of it this way. In a sense, GB18030 is the Chinese equivalent of
UTF-8. UTF-8 is designed to preserve ASCII compatibility, whereas
GB18030 is designed to preserve GB2312/GBK compatibility. They are
functionally equivalent, one for China, and one for the world.
Given a text file with U+0000..U+D7FF, U+E000..U+10FFFF,
iconv -f ucs4 -t utf8 all-legal-unicode-codepoints.txt
completes without error, I don't see any reason why
iconv -f ucs4 -t gb18030 all-legal-unicode-codepoints.txt
Determining whether a Unicode codepoint is assigned or not is not the
job of gb18030.c. Besides, glibc's GB18030 module should follow
internationally recognized official standards agreed by both Unicode
Consortium and the Chinese standards committee.
Anthony Fok Tung-Ling
ThizLinux Laboratory <firstname.lastname@example.org> http://www.thizlinux.com/
Debian Chinese Project <email@example.com> http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/