This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/12830] ISO-2022-JP-2 maps C1 control characters incorrectly


http://sourceware.org/bugzilla/show_bug.cgi?id=12830

--- Comment #2 from G. Halkes <glibcbugz at ghalkes dot nl> 2011-09-20 13:20:34 UTC ---
Testcase: in bash, using GNU libc iconv (converts U+0081 from C1):

echo -e -n '\x00\x81' | iconv -f UTF-16BE -t ISO-2022-JP-2 | od -t x1

result:

0000000 1b 2e 41 1b 4e 01
0000006

expected result:

0000000 1b 41
0000002

The standard I base my opinion on is ECMA-35, which can be found at
http://www.ecma-international.org/publications/standards/Ecma-035.htm and
which, according to ECMA itself, is "fully identical with International
Standard ISO/IEC 2022:1994". However, the ECMA-35 specification is freely
available, contrary to the ISO-2022 spec.

Specifically, section 9 discusses the structure of 7-bit codes, such as
ISO-2022-JP-2. It references section 7.2, which discusses the definitions of G0
through G3 and C0 and C1. In the specification of graphics sets G0 - G3, it
notes that it uses "column numbers" 02 through 07, i.e. has values between 0x20
and 0x7f. For C1 codes, it defines that they use column numbers 08 and 09, or
ESC Fe. The meaning of the Fe is explained in section 6.4.3 and 13.2, and
basically means a byte in the range 0x40 - 0x5f.

In my reading of the standard, changing GL to one of G0 through G3, using any
of the shift mechanisms, has no impact on the control codes in CL (the range
0x00 through 0x1f). Therefore, the generated sequence is incorrect, and is
essentially equal to the sequence "01".

Because columns 08 and 09 are not used in a 7 bit code such as ISO-2022-JP-2,
it has to use the ESC Fe construct for representing C1 control codes. Thus the
correct sequence would be "1b 41". This actually corresponds to how the control
character SS2 (U+008E) from C1 is already encoded in the example (i.e. "1b
4e"). See also Figure 8 on page 22, for a graphical representation of the
structure of a 7 bit code.

I hope this sufficiently clarifies my previous report.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]