When converting CP1255 to UTF-8, the last letter in the output buffer is missing, if inbytesleft is passed without couting the terminating null. Assuming I have a five letter cp1255 string, the first four letters will be converted correctly, but the last letter will be missing (though inbytesleft is zero after iconv). The problem does not happen if inbytesleft is 6, counting the terminating null of inbuf. I've attached a code that reproduces the case. Doing the same conversion on cp1253, giving 5 as the length, produces the correct string, so I assume there's a problem with cp1255 specificly. Simply change CP1255 to CP1253 and watch the last letter appear... Reproduction code: #include <stdio.h> #include <iconv.h> void tohex(char *str) { char *ch; for (ch = str ; *ch ; ch++) { printf("%02x ", *ch & 0xff); } printf("\n"); } int main() { // A five-letter hebrew word (cp1255) - null terminated char buf[] = { 0xf4, 0xe5, 0xf8, 0xe5, 0xed, 0x00 }; char outbuf[255]; char *cp1255 = buf; char *utf8 = outbuf; size_t bytes_read = 5, bytes_written = 255; iconv_t the_iconv = iconv_open("UTF8", "CP1255"); iconv(the_iconv, &cp1255, &bytes_read, &utf8, &bytes_written); printf("buf: "); tohex(buf); printf("bytes_read: %d\n", bytes_read); printf("bytes_written: %d\n", bytes_written); printf("outbuf: "); tohex(outbuf); return 0; }
Curiously, ISO-8859-8 also doesn't have this bug. The weird custom code in CP1255 (unlike the boilerplate code in the rest of the CPs) is probably to blame.
From a look into the code today, it turns out the problem origins from the fact that CP-1255 is a stateful encoding. Upon taking a Hebrew letter from input, it'll buffer it and take the next letter. If the next letter is a Nikud (Hebrew diacritic mark) one, it'll try to find a Unicode codepoint which can represent the Letter+Nikud sequence as a single Unicode codepoint and then use it. In short, it always buffers one letter back. Therefore, unless the string is null-terminated (thus having the null flush the last character), the last character will remain in the "state" storage and will never be flushed as a Unicode codepoint to the output. It doesn't perform a flush of the state at the end, probably since iconv is a stream recoder and therefore the next call might provide the continuation of the stream (e.g. the first part of the stream ends with a Letter and the next part begins with a Nikud, and they can be joined as a single Unicode codepoint). There's no way to tell iconv "This is the last chunk of the stream, so flush away". Solutions? Either: 1. Always flush the last character. (This is not a biggie. We will use the non-composed form of Unicode codepoint for letter + Unicode codepoint for Nikud in this case.) 2. Make iconv with NUL-input and non-NUL output perform this flush. What you say?
Is cp1255 the only encoding that's stateful? I would advise against hiding the flush decision inside iconv() - some users might need different flush behaviors, and forcing a single one might cause problems in the future. Instead, I would suggest a method for the user to decide whether he likes a flush and in which fashion (but leaving the default behaviour as it is today for backward-compatibility). This can be obtained using a function parameter with a default value to iconv(), or a manipulation function on iconv_t that does that (something like iconv_flush_policy, iconv_flush_strategy... whatever). I haven't gone into the iconv code, the coding conventions for the library or encoding related details, so my ideas might be way unacceptable by this group, but these are my 2c to give :)
> I would advise against hiding the flush decision inside iconv() - some users might need different flush behaviors, and forcing a single one might cause problems in the future. Fair enough. iconv(3) manpage doesn't provide an explicit flush method, but it does have something with similar semantics: A different case is when inbuf is NULL or *inbuf is NULL, but outbuf is not NULL and *outbuf is not NULL. In this case, the iconv function attempts to set cd’s conversion state to the initial state and store a corresponding shift sequence at *outbuf. At most *outbytesleft bytes, starting at *outbuf, will be written. Except for storing a "shift sequence" (probably something related to Far-Eastern encodings), this is just what we want to do: to reset the state. Thus, the solution would be to mandate all iconv users to call iconv(cd, NULL, NULL, &outbuf, &outlen) once more after they finished their conversion loop, just to flush out remaining data. I don't think we can add more functions to iconv at this point. iconv conforms to certain standards which we better not extend.
In the original code snippet, there's a call to iconv(the_iconv, NULL, NULL, &utf8, &bytes_written); missing at the end. Ilya Konstantinov wrote: > Thus, the solution would be to mandate all iconv users to call iconv(cd, NULL, > NULL, &outbuf, &outlen) once more after they finished their conversion loop, > just to flush out remaining data. Absolutely. That's how iconv() is meant to be used.