This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

New GB18030 gconv module contributed by ThizLinux Laboratory


Hello all!

While testing GB18030 support with some sample GB18030 text files
supplied by the Chinese IT Standardization Technical Committee, we at
ThizLinux Laboratory discovered that the GB18030 gconv module was
unable to handle certain ranges, most notably in the User-Defined Area,
but it seems /usr/bin/iconv discovered problems in other ranges too. 
Here is the result with GB18030.so in glibc-2.2.4:

  for i in *.txt; do iconv -f gb18030 -t gb18030 $i > /dev/null && echo Passed: $i || echo Failed: $i; done  | less

    Passed: single.txt, double3.txt

    Failed: double1.txt, double2.txt, double4.txt, double5,
	    four.txt, Meng.txt, Yi.txt, Wei.txt, Zang.txt,
	    user1.txt, user2.txt, user3.txt, random.txt

  (Note: Meng = Mongolian, Yi = Yi, Wei = Uyghur, Zang = Tibetan)

Last week, our colleague informed us that AbiWord could not display the
UDA characters in Random.txt properly.  Normally, it should show as a
square space, but instead, it just treats as if it doesn't exist.  It
turns out that the original gb18030.c did not cover the UDAs.  However,
for GNU systems to meet the Chinese government's standard, the entire
GB18030 charset, corresponding to U+0000..U+D7FF, U+E000..U+10FFFF
should be supported.  (U+D800-U+DFFF is the surrogate area and is
intentionally left out, otherwise GB18030 would become a weird 8-byte
standard. :-)  GB18030 is actually very similar to UTF-8
in that it is an encoding of ISO-10646-1.  It is designed so that a
one-to-one mapping between GB18030 and ISO-10646-1/Unicode is possible.

Attached is a new gb18030.c to solve the aforementioned problem.  To
make a long story short, in order to add GB18030 support to KDE, we had
to add a GB18030 codec.  Unfortunately, due to licensing issues, we
could not use the original gb18030.c from glibc-2.2.4, and so we rolled
our own using excellent information from the IBM ICU site (Markus
Scherer of IBM) and from Dirk Meyer's GB18030 summary (Adobe).  It sure
took a while, but it was fun!  ;-) A fully working version was
completed on November 26, 2001 (qgb18030codec.cpp).

To solve the problems exhibited in AbiWord, which was really a glibc
issue, we finally ported our GB18030 code to glibc.  Much C code by
Sean Chen, Ulrich and Bruno are preserved, but the algorithms and
tables are replaced with the new ones.  The new module has been tested
against the sample files provided by the Chinese IT Standardization
Technical Committee using roundtrip conversion:
   "iconv -f gb18030 -t ucs2 | iconv -f ucs2 -t gb18030"

I also used a quick-and-dirty C program (attached as gen-ucs4-data.c)
to generate U+0000..U+10FFFF (sans surrogate area) on ix86
and piped the output to
   "iconv -f ucs4 -t gb18030 | iconv -f gb18030 -t ucs4"

I have put a whole bunch of GB18030-related stuff on:

    http://people.debian.org/~foka/gb18030/
	gb18030.c
	gen-ucs4-data.c
	gen-glibc-gb18030.pl
	gb-18030-2000.xml
	samples.zip   (sample GB18030 text files for testing
			GB18030-compliance)

The Perl script that generates the source is "gen-glibc-gb18030.pl".
If you want to play with it, make sure you grab gb-18030-2000.xml too.

The current gb18030.c by ThizLinux Laboratory, dated 15 Jan 2002,
when unpacked, has a filesize of 590346 and
a md5sum of 5c5c2f03f5d9a16ca0edea066b6145ac.

And don't worry, the GB18030.so actually shrunk from around 180K to
around 130K, despite the extended coverage.  ;-)
So, please test and apply.  Perhaps you could add some
optimization, e.g. adding some more __builtin_expect etc. to make it
even more efficient.  It would be wonderful if this new gb18030.c
could be placed in the upcoming glibc-2.2.5 so that glibc becomes truly
GB18030 compliant.

Last but not least, special thanks to Bruno, Ulrich, Sean Chen, Suzhe
(James Su), WANG Shouhua, WU Jian, Dirk Meyer, Markus Scherer et al.
who have pioneered GB18030 development on GNU/Linux systems.

Best regards,

Anthony Fok
System Engineer
ThizLinux Laboratory Ltd.

-- 
Anthony Fok Tung-Ling
ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/

Attachment: gb18030.c.bz2
Description: Binary data

/*
 * This program generates data <U0000>..<UD7FF>, <UE000>..<U10FFFF>
 * in UCS4 format for testing GB18030.so gconv module.
 * Anthony Fok <anthony@thizlinux.com>  Jan 16, 2002
 */

#include <stdio.h>

unsigned long int i;

int main() {
    for (i = 0; i <= 0x10FFFF; i++) {
	/* Skip the surrogate area */
	if (i >= 0xD800 && i <= 0xDFFF)
	    continue;
	printf("%c%c%c%c", i >> 24, i >> 16 & 0xFF, i >> 8 & 0xFF, i & 0xFF);
    }
    return 0;
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]