This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
On Fri, Jan 18, 2002 at 10:39:47AM +1000, Yu Shao wrote:
> The new gb18030.c will report "illegal input sequence" with 0x33ff,
> which version of gb18030.c are you using, I think you were confused by
> yourself.
No, this is wrong. As I have said, in the Unicode standard, U+33FF
is "legal" but "unassigned". If gb18030.c says it is "illegal", it is
glibc's bug.
> Actually, there are some problems with the standard test file like
> user3.txt. The current user3.txt has a17f, a27f .... a77f, those are all
> invalid according to the GB18030 standard itself.
No, no no. Those are "legal=valid" and "unassigned". Calling these
characters "invalid" shows a lack of understanding of GB18030 and Unicode,
and mapping these characters to "illegal" (i.e. "\x00\x00") is a bug.
What is means is that the _first_ version you submitted to Ulrich,
when you defined mappings for characters like 0x33FF, was likely
to be correct.
And, as you must have understood from reading the GB18030 Standard,
I am sure you will have understood that GB18030 is intended to be the
PRC's version of "UTF", functionally similar to UTF-8.
Again, I ask, if "iconv -f ucs4 -t utf8" works for all codepoints
(U+0000..U+D7FF, U+E000..U+10FFFF), why can't
"iconv -f ucs4 -t gb18030" do the same? This is _inconsistent_.
If you insist on mapping U+33FF to _illegal_ (which is wrong), then I
will only rest my case if "iconv -f ucs4 -t utf8" also returns "illegal
input sequence" with U+33FF and other unassigned characters. At least
this way now, it is consistent.
Now, I could care less whether your gb18030.c or my gb18030.c is used,
as long as the final gb18030.c does the _right_ thing, which is to be
able to map the _entire_ valid Unicode/ISO-10646-1 codepoints
(U+0000..U+D7FF, U+E000..U+10FFFF), as intented by the GB18030
Standard.
And please, Markus Scherer, through IBM and the Unicode Consortium, is
actively involved with the GB18030 standard as well as with Unicode /
ISO-10646-1. He and Dirk Meyer are two of the most authorative sources
of GB18030 outside of Mainland China. Just think for a moment that he
may just be right.
If we cannot agree, perhaps we can ask a member on the GB18030 standards
committee to tell us who is correct?
Best regards,
Anthony
--
Anthony Fok Tung-Ling
ThizLinux Laboratory <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org> http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp! http://www.olvc.ab.ca/