This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)

I agree with what Anthony said about mapping code points: Even if they do 
not have assigned characters, their mappings are defined. This is true for 
all Unicode code points except _single_ surrogate code points 
Mapping _from_ GB 18030 may sometimes result in "unassigned" handling 
because some 4-byte GB 18030 sequences are defined but do not have 
mappings to Unicode.

Dirk and my publications on this are based on a printed version of the GB 
18030 standard from 2000 (plus the published electronic mapping tables), 
and from following discussions about the standard as much as possible. (I 
do not read/speak Chinese, but Dirk does; our companies had Chinese 
representatives that were in frequent discussion with the Chinese 
standards agency.)

Note that the supplementary Unicode code points U+10000..U+10ffff were 
_designated_ in Unicode 2.0 (1996), with the pseudo-assignment of 
128*1024-4 of those code points (U+f0000..U+ffffd and U+100000..U+10fffd) 
as a Private-Use Area.
Unicode 3.1 did not invent this supplementary range but was "only" the 
first Unicode version that assigned "real" characters to such code points 
(and assigned >40000 of them).

Note also that formally GB 18030 defines mappings to ISO 10646, not 
Unicode. One of the differences is the publication schedule. Supplementary 
character assignments were published only in December 2001 with ISO 10646 
part 2, which synchronized with Unicode 3.1 several months after its 


Markus Scherer  IBM GCoC-Unicode/ICU  San Josť, CA (also for SameTime)

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]