View Bug Activity | Format For Printing
The conversion tables for IBM943 and IBM942 are incorrect for iconv. The byte values for 1A, 1C and 7F do not round trip to Unicode (UTF-8) and back to these Shift-JIS codepages. Normally Unicode 1A roundtrip maps to Shift-JIS 7F, Unicode 7F roundtrip maps to Shift-JIS 1C and Unicode 1C roundtrip maps to Shift-JIS 1A. iconv does not have this behavior. For example iconv has the following behavior, Unicode 1F converts to Shift-JIS 1C, and Shift-JIS 1C converts to Unicode 1A. If you would like the mapping tables generated from IBM's official repository of coded character sets, I recommend you look at these tables, and use them for the basis of iconv. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-942_P12A-1999.ucm http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-943_P15A-2003.ucm For reference, here are other tables that can be used for the same CCSID (coded character set identifier). http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-942_P120-1999.ucm http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-942_P12A-1998.ucm http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-943_P130-1999.ucm http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-943_P14A-1998.ucm http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/charset/data/ucm/ibm-943_P14A-1999.ucm (full disclosure) I work for IBM, and I am a part of the ICU project.
We don't have an ibm942 conversion module and the ibm943 module was generated by IBM. I have no reason to believe the ICU tables more than those used to generate the module. I'll leave the bug open, maybe the module author will comment. If this doesn't happen I'll close it sometime soon.
(In reply to comment #1) You're right. iconv doesn't have ibm-942. I meant ibm-932. Sorry about that. The ibm-* tables from ICU's charset repository are generated directly from IBM's CDRA. I'm sure that the ibm943 iconv module was also generated from IBM, but this seems to be a typo in the iconv module. The main issue is not whether \u007F goes to \x7F or \x1C. Both mapping behaviors are considered valid in IBM's CDRA. The problem is that those bytes don't map back to the original Unicode character. You have to round trip convert your data three times to get your original data back.
It's pointless to argue here. Talk to the author of the modules. I'm suspending the bug until that happened.
(In reply to comment #3) Since I am unfamiliar with the authors of the module, where or who should I really be reporting this problem to?
(In reply to comment #4) George Rhoten, you find the authors of the conversion modules in the glibc source files and ChangeLogs. For both iconvdata/ibm932.c and iconvdata/ibm943.c, it is Masahide Washizawa <washi@jp.ibm.com>.
No reply in almost 4 months. Reopen if you get real information.
I have contacted the writer of this code, and he has created a patch to fix the code.
Quote from Masahide Washizawa, "I have just sent the patch to Ulrich-san who is the glibc maintainer, and he applied it to the glibc tree immediately." So if the patch is applied, then I'm happy.
The patch was applied to the cvs. I close it.