Bug 1124

Summary: iconv incorrectly converts cp1255
Product: glibc Reporter: duv <z9u2k>
Component: libcAssignee: GOTO Masanori <gotom>
Status: RESOLVED INVALID    
Severity: normal CC: glibc-bugs, redhat-bugzilla
Priority: P2 Flags: fweimer: security-
Version: 2.3.5   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description duv 2005-07-23 19:58:36 UTC
When converting CP1255 to UTF-8, the last letter in the output buffer is
missing, if inbytesleft is passed without couting the terminating null.

Assuming I have a five letter cp1255 string, the first four letters will be
converted correctly, but the last letter will be missing (though inbytesleft is
zero after iconv).

The problem does not happen if inbytesleft is 6, counting the terminating null
of inbuf.

I've attached a code that reproduces the case.

Doing the same conversion on cp1253, giving 5 as the length, produces the
correct string, so I assume there's a problem with cp1255 specificly.

Simply change CP1255 to CP1253 and watch the last letter appear...

Reproduction code:
#include <stdio.h>
#include <iconv.h>

void tohex(char *str) {
    char *ch;
    for (ch = str ; *ch ; ch++) {
        printf("%02x ", *ch & 0xff);
    }
    printf("\n");
}

int main() {
    // A five-letter hebrew word (cp1255) - null terminated
    char buf[] = { 0xf4, 0xe5, 0xf8, 0xe5, 0xed, 0x00 };

    char outbuf[255];

    char *cp1255 = buf;
    char *utf8 = outbuf;

    size_t bytes_read = 5, bytes_written = 255;
    iconv_t the_iconv = iconv_open("UTF8", "CP1255");
    iconv(the_iconv, &cp1255, &bytes_read, &utf8, &bytes_written);

    printf("buf: ");
    tohex(buf);

    printf("bytes_read: %d\n", bytes_read);
    printf("bytes_written: %d\n", bytes_written);
    printf("outbuf: ");
    tohex(outbuf);

    return 0;
}
Comment 1 Ilya Konstantinov 2005-07-30 14:09:31 UTC
Curiously, ISO-8859-8 also doesn't have this bug. The weird custom code in
CP1255 (unlike the boilerplate code in the rest of the CPs) is probably to blame.
Comment 2 Ilya Konstantinov 2005-07-31 20:44:54 UTC
From a look into the code today, it turns out the problem origins from the fact
that CP-1255 is a stateful encoding. Upon taking a Hebrew letter from input,
it'll buffer it and take the next letter. If the next letter is a Nikud (Hebrew
diacritic mark) one, it'll try to find a Unicode codepoint which can represent
the Letter+Nikud sequence as a single Unicode codepoint and then use it.

In short, it always buffers one letter back. Therefore, unless the string is
null-terminated (thus having the null flush the last character), the last
character will remain in the "state" storage and will never be flushed as a
Unicode codepoint to the output.

It doesn't perform a flush of the state at the end, probably since iconv is a
stream recoder and therefore the next call might provide the continuation of the
stream (e.g. the first part of the stream ends with a Letter and the next part
begins with a Nikud, and they can be joined as a single Unicode codepoint).
There's no way to tell iconv "This is the last chunk of the stream, so flush away".

Solutions? Either:
1. Always flush the last character. (This is not a biggie. We will use the
non-composed form of Unicode codepoint for letter + Unicode codepoint for Nikud
in this case.)
2. Make iconv with NUL-input and non-NUL output perform this flush.

What you say?
Comment 3 duv 2005-07-31 21:09:58 UTC
Is cp1255 the only encoding that's stateful?

I would advise against hiding the flush decision inside iconv() - some users
might need different flush behaviors, and forcing a single one might cause
problems in the future.

Instead, I would suggest a method for the user to decide whether he likes a
flush and in which fashion (but leaving the default behaviour as it is today for
backward-compatibility).

This can be obtained using a function parameter with a default value to iconv(),
or a manipulation function on iconv_t that does that (something like
iconv_flush_policy, iconv_flush_strategy... whatever).

I haven't gone into the iconv code, the coding conventions for the library or
encoding related details, so my ideas might be way unacceptable by this group,
but these are my 2c to give :)
Comment 4 Ilya Konstantinov 2005-07-31 21:19:34 UTC
> I would advise against hiding the flush decision inside iconv() - some users
might need different flush behaviors, and forcing a single one might cause
problems in the future.

Fair enough. iconv(3) manpage doesn't provide an explicit flush method, but it
does have something with similar semantics:
       A different case is when inbuf is NULL or *inbuf is NULL, but outbuf is
       not NULL and *outbuf is not NULL. In  this  case,  the  iconv  function
       attempts  to set cd’s conversion state to the initial state and store a
       corresponding shift sequence at *outbuf.  At most *outbytesleft  bytes,
       starting at *outbuf, will be written.
Except for storing a "shift sequence" (probably something related to Far-Eastern
encodings), this is just what we want to do: to reset the state.

Thus, the solution would be to mandate all iconv users to call iconv(cd, NULL,
NULL, &outbuf, &outlen) once more after they finished their conversion loop,
just to flush out remaining data.

I don't think we can add more functions to iconv at this point. iconv conforms
to certain standards which we better not extend.
Comment 5 Bruno Haible 2005-08-01 12:11:27 UTC
In the original code snippet, there's a call to 
 
     iconv(the_iconv, NULL, NULL, &utf8, &bytes_written); 
 
missing at the end. 
 
Ilya Konstantinov wrote: 
> Thus, the solution would be to mandate all iconv users to call iconv(cd, NULL, 
> NULL, &outbuf, &outlen) once more after they finished their conversion loop, 
> just to flush out remaining data. 
 
Absolutely. That's how iconv() is meant to be used.