This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)


On Fri, Jan 18, 2002 at 10:39:47AM +1000, Yu Shao wrote:
> The new gb18030.c will report "illegal input sequence" with 0x33ff, 
> which version of gb18030.c are you using, I think you were confused by 
> yourself.

No, this is wrong.  As I have said, in the Unicode standard, U+33FF
is "legal" but "unassigned".  If gb18030.c says it is "illegal", it is
glibc's bug.

> Actually, there are some problems with the standard test file like 
> user3.txt. The current user3.txt has a17f, a27f .... a77f, those are all 
> invalid according to the GB18030 standard itself.

No, no no.  Those are "legal=valid" and "unassigned".  Calling these
characters "invalid" shows a lack of understanding of GB18030 and Unicode,
and mapping these characters to "illegal" (i.e. "\x00\x00") is a bug.

What is means is that the _first_ version you submitted to Ulrich,
when you defined mappings for characters like 0x33FF, was likely
to be correct.

And, as you must have understood from reading the GB18030 Standard,
I am sure you will have understood that GB18030 is intended to be the
PRC's version of "UTF", functionally similar to UTF-8.
Again, I ask, if "iconv -f ucs4 -t utf8" works for all codepoints
(U+0000..U+D7FF, U+E000..U+10FFFF), why can't
"iconv -f ucs4 -t gb18030" do the same?  This is _inconsistent_.

If you insist on mapping U+33FF to _illegal_ (which is wrong), then I
will only rest my case if "iconv -f ucs4 -t utf8" also returns "illegal
input sequence" with U+33FF and other unassigned characters.  At least
this way now, it is consistent.

Now, I could care less whether your gb18030.c or my gb18030.c is used,
as long as the final gb18030.c does the _right_ thing, which is to be
able to map the _entire_ valid Unicode/ISO-10646-1 codepoints
(U+0000..U+D7FF, U+E000..U+10FFFF), as intented by the GB18030
Standard.

And please, Markus Scherer, through IBM and the Unicode Consortium, is
actively involved with the GB18030 standard as well as with Unicode /
ISO-10646-1.  He and Dirk Meyer are two of the most authorative sources
of GB18030 outside of Mainland China.  Just think for a moment that he
may just be right.

If we cannot agree, perhaps we can ask a member on the GB18030 standards
committee to tell us who is correct?

Best regards,

Anthony

-- 
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]