This is the mail archive of the
mailing list for the glibc project.
Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
- From: ha shao <hashao at linuxstar dot hypermart dot net>
- To: Roger So <roger dot so at sw-linux dot com>
- Cc: Ulrich Drepper <drepper at redhat dot com>,Markus Scherer <markus dot scherer at us dot ibm dot com>,Anthony Fok <anthony at thizlinux dot com>, fai at thizlinux dot com,Bruno Haible <haible at ilog dot fr>, kevin at thizlinux dot com,libc-alpha at sources dot redhat dot com, sunnygu at thizgroup dot com,suzhe at gnuchina dot org, Yu Shao <yshao at redhat dot com>
- Date: Sun, 20 Jan 2002 21:03:42 +0800
- Subject: Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
- References: <OFA37FFE2D.E724FA8B-ON88256B44.006732B8@raleigh.ibm.com> <firstname.lastname@example.org> <1011456885.1311.22.camel@foobar>
- Reply-to: hashao at hashao dot hypermart dot net
On Sun, Jan 20, 2002 at 12:14:45AM +0800, Roger So wrote:
> On Fri, 2002-01-18 at 05:41, Ulrich Drepper wrote:
> > "Markus Scherer" <email@example.com> writes:
> > > I agree with what Anthony said about mapping code points: Even if they do
> > > not have assigned characters,
> > It is completely irrelevant what you think. The converters convert
> > from the external charset to the internal private charset. The latter
> > is defined in a way which disallows any non-Unicode position. What
> > you do with your own code I don't care; but stay out of discussions
> > like this when they are related to glibc.
> May I ask where does it say that the "converters convert from the
> external charset to the internal private charset"? In fact, in "The GNU
Are we talking about the iconv interface or the conversion
between multibyte character and *wide character* (wchar_t)?
I think the 2 should be treated differently as explained below...
For iconv interface, it is reasonable to ask all the code space
defined by Unicode standard be available, wheather certain code
points is assigned a glyph yet. Of course it is nonsense to
map from an arbitrary charset code to an unassigned Unicode point.
But GB18030 is no just another arbitrary charset (or coded charset,
whatever). GB18030 intends to be a bridge to Unicode from GB2312.
It promised to be, in a sense, compatible with Unicode in the future.
The unsigned code points in GB18030 is not arbitarily mapped
to the Unicode. The mapping or the algorithm of the mapping is
already set now. There are no worry about pollute the Unicode
code space by mapping unassigned GB18030 points to unassigned
For wchar_t data type, It is implement specific. Glibc just happend
to choose Unicode. So if currently unassigned Unicode code points
are not allow in glibc's wchar_t implement, the check should be
done, in my opinion, at the mbtowc() level. Then even if mbtowc()
calls iconv() internally, iconv() can still make all defined
Unicode code space available.
For the GB18030 official certification test, I think the
intention of including unassigned code points (even for GB18030
itself) is also to make sure the code space defined
in the GB18030 standard are properly supported. Such that
future addition to the GB18030 (actually Unicode) will
not require application also being updated in order to function
correctly. In the future, only font files need to be updated
if glyph displaying is needed.
The current situation is just like hardwired the defined
27,000+ UniHan into an application (iconv in current case).
Any future addition to Unicode need to have the application
altered. Hardwiring is bad in general, and unnecessary in
this particular case IMHO.
With the fact that the GB18030 is not just another charset,
but practically another encoding form of Unicode. With this
fact in mind, it might be reasonable to expose all the
Unicode code space to GB18030 in the iconv() interface.
Because GB18030 will not mess up Unicode's code space, it should
not mess up wchar_t used in glibc. But since glibc does not
allow unassigned Unicode code point in wchar_t's implement,
the must-assigned criterion can be enforced inside wctomb().