This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)


I am sorry for writing another certainly offensive email -

It is not unusual to transport, process, and map unassigned codes in 
various charsets including Unicode.

There is nothing "incorrect input" about U+33ff; the fact that it is not 
mentioned in UnicodeData.txt only means that it has no assigned character 
and that it has only default properties. A converter for UTF-8, SCSU, or 
GB 18030 must convert it. For UTFs, this is even a Unicode conformance 
requirement.

In fact, U+33ff (and all other unassigned and non-character code points) 
do have a Unicode general category value of "Cn". They are not mentioned 
in UnicodeData.txt because the definition of the Unicode character 
database says that they aren't.

You can see this in the description of the general categories in 
http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.html :
For general category Cn it says "Other, Not Assigned (no characters in the 
file have this property)". This means that everything that is not in 
UnicodeData.txt has Cn.

Another Unicode conformance requirement says that you must pass through 
(and not throw away or corrupt) code points that you don't know anything 
about if you purport to not modify the contents of the text.

Non-Unicode examples for handling unassigned code points include most East 
Asian charsets with their areas for user-defined characters, private use, 
"reserved" etc. They are frequently mapped, e.g. in GBK<->Unicode to/from 
parts of the Unicode private-use areas.


My understanding is that a GB 18030 converter that does not handle 
unassigned-but-legal codes will not pass certification (but I am also not 
an expert in certification).

markus


Markus Scherer  IBM GCoC-Unicode/ICU  San Josť, CA 
markus.scherer@us.ibm.com (also for SameTime)





Ulrich Drepper <drepper@redhat.com>
Sent by: drepper@myware.mynet
01/17/2002 06:03 PM
Please respond to drepper

 
        To:     Anthony Fok <anthony@thizlinux.com>
        cc:     Yu Shao <yshao@redhat.com>, libc-alpha@sources.redhat.com, 
kevin@thizlinux.com, fai@thizlinux.com, sunnygu@thizgroup.com, 
suzhe@gnuchina.org, Markus Scherer/Cupertino/IBM@IBMUS, Bruno Haible 
<haible@ilog.fr>
        Subject:        Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)

 

Anthony Fok <anthony@thizlinux.com> writes:

> No, this is wrong.  As I have said, in the Unicode standard, U+33FF
> is "legal" but "unassigned".  If gb18030.c says it is "illegal", it is
> glibc's bug.

No.  This is an incorrect input.  Period.  There is no discussion
about it.  I've already said that no character which is not in the
current UnicodeData list must be converted.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]