glibc's iconv

Mike Mohr akihana@gmail.com
Sun Jul 5 10:23:00 GMT 2009


I just discovered Windows-31J/CP932 which solves 4 out of the 6
discrepancies (the roman numerals).  However, the seemingly illegal
character code 0x8120 (which appears as a bullet in Firefox) is still
failing.  Any suggestions as to which character set I should be using?

Thanks,
Mike

On Sun, Jul 5, 2009 at 12:05 AM, Mike Mohr<akihana@gmail.com> wrote:
> glibc developers,
>
> I am working on a library whose intended purpose is to convert a
> dictionary from a proprietary format into an open one.  The end result
> should be that a dictionary from www.babylon.com is usable in e.g.
> StarDict.  I'm using an English -> Japanese dictionary as a test case
> at the moment and I've run into a rather odd problem.  Certain
> multibyte characters in my input (which is ShiftJIS-encoded) cause
> iconv to return EILSEQ, but these characters appear to be valid
> characters in the ShiftJIS encoding.  Out of over 155,000 entries in
> my input only 6 exhibit this behavior and they all display correctly
> when viewed in Firefox.  I've attached a text file with the cases that
> fail; if you open it in Firefox it shows the original ShiftJIS input
> as well as the place where conversion failed.
>
> To quote from the libiconv website here:
>  http://www.gnu.org/software/libiconv/
> ..."To solve this mess, the Unicode encoding has been created. It is a
> super-encoding of all others and is therefore the default encoding for
> new text formats like XML."
>
> My conversion descriptors all have destination charset "utf8", so the
> iconv_open call looks like this:
>  iconv_open(cd, "utf8", "sjis");
> which, when combined with the statement from your website, makes me
> think that any character from ShiftJIS would be encodable in UTF8 but
> not the other way around.  I found a post to a mailing list from 1999
> which is interesting:
>
> http://mail.nl.linux.org/linux-utf8/1999-11/msg00201.html
>
> I'm using Gentoo Linux with sys-libs/glibc-2.9_p20081201-r2 installed.
>  Can anyone shed light on why the conversion to utf8 is failing?  My
> gut feeling is that the input data is possibly nonstandard, or maybe
> it is some subtle variant of ShiftJIS.  If this is the case, is it
> possible to patch glibc to support the conversion?
>
> Thanks,
> Mike
>



More information about the Libc-help mailing list