This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)


Anthony Fok wrote:

>Hello Yu Shao,
>
>On Thu, Jan 17, 2002 at 10:35:12PM +1000, Yu Shao wrote:
>
>>>Not sure if it is a problem with /usr/bin/iconv or GB18030.so:
>>>when I tried your module, both old and new, on the Chinese sample
>>>documents:
>>>
>>>	$ iconv -f gb18030 -t ucs2 four.txt
>>>	iconv: illegal input sequence at position 32
>>>	$ iconv -f gb18030 -t ucs2 wei.txt
>>>	iconv: illegal input sequence at position 0
>>>	$ iconv -f gb18030 -t ucs2 zang.txt
>>>	iconv: illegal input sequence at position 0
>>>	$ iconv -f gb18030 -t ucs2 wei.txt
>>>	iconv: illegal input sequence at position 0
>>>	$ iconv -f gb18030 -t ucs2 yi.txt
>>>	iconv: illegal input sequence at position 0
>>>
>>>If the first line is trimmed, the illegal sequence appears at 27420 for
>>>four.txt, etc.  It appears to me that your tables only cover the bare
>>>minimum required by the Chinese Standards Committee, but this is not
>>>quite right.  GB18030 is supposed to be like UTF-8: it is an encoding
>>>that covers the entire repertoire of ISO-10646-1 while remaining
>>>compatible with GB2312 and GBK.  It should be able convert to and from
>>>all Unicode codepoints, i.e. U+0000..U+D7FF, U+E000..U+FFFF,
>>>U+10000..U+10FFFF.
>>>
>>The character in the postion of 32 of four.txt is 0x8139EE38 whose
>>unicode is 0x33FF, if you can have a look of unicode table, 0x33FF is a
>>undefined invalide value. Acutally the same things with those other four
>>test files. Converting gb code like 0x8139ee38 to a  non-exist unicode
>>really means nothing.
>>
>
>I beg to differ.  An "undefined" value is not "invalid".  An undefined
>value today does not mean it won't be defined in the future.  By your
>reasoning, converting a file containing U+33FF from UCS2 to UTF-8
>should also fail, and yet it doesn't, and for very good reason.
>Besides, if U+33FF were invalid, why on earth will there be a
>GB+8139EE38 that maps to U+33FF?  You mean the Chinese GB18030 standard
>commitee intentionally put an invalid code in their _standard_ test
>files?
>
The new gb18030.c will report "illegal input sequence" with 0x33ff, 
which version of gb18030.c are you using, I think you were confused by 
yourself.


Actually, there are some problems with the standard test file like 
user3.txt. The current user3.txt has a17f, a27f .... a77f, those are all 
invalid according to the GB18030 standard itself.

>


-- 
Yu Shao
Red Hat Asia-Pacific
+61 7 3872 4835
Legal:   http://apac.redhat.com/disclaimer




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]