This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

glibc's iconv


glibc developers,

I am working on a library whose intended purpose is to convert a
dictionary from a proprietary format into an open one.  The end result
should be that a dictionary from www.babylon.com is usable in e.g.
StarDict.  I'm using an English -> Japanese dictionary as a test case
at the moment and I've run into a rather odd problem.  Certain
multibyte characters in my input (which is ShiftJIS-encoded) cause
iconv to return EILSEQ, but these characters appear to be valid
characters in the ShiftJIS encoding.  Out of over 155,000 entries in
my input only 6 exhibit this behavior and they all display correctly
when viewed in Firefox.  I've attached a text file with the cases that
fail; if you open it in Firefox it shows the original ShiftJIS input
as well as the place where conversion failed.

To quote from the libiconv website here:
  http://www.gnu.org/software/libiconv/
..."To solve this mess, the Unicode encoding has been created. It is a
super-encoding of all others and is therefore the default encoding for
new text formats like XML."

My conversion descriptors all have destination charset "utf8", so the
iconv_open call looks like this:
  iconv_open(cd, "utf8", "sjis");
which, when combined with the statement from your website, makes me
think that any character from ShiftJIS would be encodable in UTF8 but
not the other way around.  I found a post to a mailing list from 1999
which is interesting:

http://mail.nl.linux.org/linux-utf8/1999-11/msg00201.html

I'm using Gentoo Linux with sys-libs/glibc-2.9_p20081201-r2 installed.
 Can anyone shed light on why the conversion to utf8 is failing?  My
gut feeling is that the input data is possibly nonstandard, or maybe
it is some subtle variant of ShiftJIS.  If this is the case, is it
possible to patch glibc to support the conversion?

Thanks,
Mike

Attachment: errors.txt.gz
Description: GNU Zip compressed data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]