This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: glibc's iconv


I just discovered Windows-31J/CP932 which solves 4 out of the 6
discrepancies (the roman numerals).  However, the seemingly illegal
character code 0x8120 (which appears as a bullet in Firefox) is still
failing.  Any suggestions as to which character set I should be using?

Thanks,
Mike

On Sun, Jul 5, 2009 at 12:05 AM, Mike Mohr<akihana@gmail.com> wrote:
> glibc developers,
>
> I am working on a library whose intended purpose is to convert a
> dictionary from a proprietary format into an open one. ?The end result
> should be that a dictionary from www.babylon.com is usable in e.g.
> StarDict. ?I'm using an English -> Japanese dictionary as a test case
> at the moment and I've run into a rather odd problem. ?Certain
> multibyte characters in my input (which is ShiftJIS-encoded) cause
> iconv to return EILSEQ, but these characters appear to be valid
> characters in the ShiftJIS encoding. ?Out of over 155,000 entries in
> my input only 6 exhibit this behavior and they all display correctly
> when viewed in Firefox. ?I've attached a text file with the cases that
> fail; if you open it in Firefox it shows the original ShiftJIS input
> as well as the place where conversion failed.
>
> To quote from the libiconv website here:
> ?http://www.gnu.org/software/libiconv/
> ..."To solve this mess, the Unicode encoding has been created. It is a
> super-encoding of all others and is therefore the default encoding for
> new text formats like XML."
>
> My conversion descriptors all have destination charset "utf8", so the
> iconv_open call looks like this:
> ?iconv_open(cd, "utf8", "sjis");
> which, when combined with the statement from your website, makes me
> think that any character from ShiftJIS would be encodable in UTF8 but
> not the other way around. ?I found a post to a mailing list from 1999
> which is interesting:
>
> http://mail.nl.linux.org/linux-utf8/1999-11/msg00201.html
>
> I'm using Gentoo Linux with sys-libs/glibc-2.9_p20081201-r2 installed.
> ?Can anyone shed light on why the conversion to utf8 is failing? ?My
> gut feeling is that the input data is possibly nonstandard, or maybe
> it is some subtle variant of ShiftJIS. ?If this is the case, is it
> possible to patch glibc to support the conversion?
>
> Thanks,
> Mike
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]