Bug 13541 - iconv //IGNORE charsets are inconsistent about INBUF* state after EILSEQ
Summary: iconv //IGNORE charsets are inconsistent about INBUF* state after EILSEQ
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.14
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-23 01:58 UTC by Edward Z. Yang
Modified: 2017-05-15 10:01 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
HTMLPurifier.standalone (148.52 KB, patch)
2017-02-28 17:07 UTC, Tigran
Details | Diff
HTMLPurifier.standalone (147.46 KB, application/x-php)
2017-02-28 17:17 UTC, Tigran
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Edward Z. Yang 2011-12-23 01:58:34 UTC
The iconv infopage says the following:

    `EILSEQ'
          The conversion stopped because of an invalid byte sequence in
          the input.  After the call, `*INBUF' points at the first byte
          of the invalid byte sequence.

However, this is clearly not the case when an //IGNORE target charset is specified:

    #include <iconv.h>
    #include <string.h>
    #include <stdio.h>
    #include <errno.h>
    int main() {
        iconv_t i = iconv_open("ascii//IGNORE", "utf-8");
        char inbuf[10000];
        char outbuf[10000];
        char *in = inbuf;
        char *out = outbuf;
        int inleft = 10000;
        int outleft = 10000;
        int s;
        memset(inbuf, 0x77, 10000);
        inbuf[0] = 0xC2;
        inbuf[1] = 0xA2;
        s = iconv(i, &in, &inleft, &out, &outleft);
        printf("s = %d, errno = %d, in[0] = %x, inleft = %d\n", s, errno, (unsigned char)*in, inleft);
    }

Outputs the following:

    s = -1, errno = 84, in[0] = 77, inleft = 1839

'iconv' appears to have gobbled up another ~8000 bytes after the invalid byte sequence, before returning EILSEQ (84).

The documentation here cannot possibly correct, if we want 'IGNORE' to actually do anything. So we have two options:

1. Claim that the semantics of EILSEQ change when the magic //IGNORE flag is specified, and require user code to work around it properly. This is what the '-c' flag in iconv_prog.c does, by magically "converting" these errors into E2BIG errors, and re-running iconv appropriately.

2. Claim that the this API is wrong, and modify the API such that an iconv operating on an //IGNORE character set *never* returns EILSEQ (what one might expect, since IGNORE is supposed to allow us to ignore sequences that are illegal in the target). This would make glibc's iconv implementation consistent with libiconv's.

I favor (2), since it makes client code considerably simpler and easier to implement correctly.
Comment 1 Jackie Rosen 2014-02-16 18:25:40 UTC Comment hidden (spam)
Comment 2 Tigran 2017-02-28 17:07:21 UTC
Created attachment 9861 [details]
HTMLPurifier.standalone
Comment 3 Tigran 2017-02-28 17:17:29 UTC
Created attachment 9862 [details]
HTMLPurifier.standalone
Comment 4 Florian Weimer 2017-05-08 05:26:44 UTC
I agree that option (2) (never return EILSEQ with //IGNORE) makes the most sense.