This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
glibc developers, I am working on a library whose intended purpose is to convert a dictionary from a proprietary format into an open one. The end result should be that a dictionary from www.babylon.com is usable in e.g. StarDict. I'm using an English -> Japanese dictionary as a test case at the moment and I've run into a rather odd problem. Certain multibyte characters in my input (which is ShiftJIS-encoded) cause iconv to return EILSEQ, but these characters appear to be valid characters in the ShiftJIS encoding. Out of over 155,000 entries in my input only 6 exhibit this behavior and they all display correctly when viewed in Firefox. I've attached a text file with the cases that fail; if you open it in Firefox it shows the original ShiftJIS input as well as the place where conversion failed. To quote from the libiconv website here: http://www.gnu.org/software/libiconv/ ..."To solve this mess, the Unicode encoding has been created. It is a super-encoding of all others and is therefore the default encoding for new text formats like XML." My conversion descriptors all have destination charset "utf8", so the iconv_open call looks like this: iconv_open(cd, "utf8", "sjis"); which, when combined with the statement from your website, makes me think that any character from ShiftJIS would be encodable in UTF8 but not the other way around. I found a post to a mailing list from 1999 which is interesting: http://mail.nl.linux.org/linux-utf8/1999-11/msg00201.html I'm using Gentoo Linux with sys-libs/glibc-2.9_p20081201-r2 installed. Can anyone shed light on why the conversion to utf8 is failing? My gut feeling is that the input data is possibly nonstandard, or maybe it is some subtle variant of ShiftJIS. If this is the case, is it possible to patch glibc to support the conversion? Thanks, Mike
Attachment:
errors.txt.gz
Description: GNU Zip compressed data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |