Summary: | iconv(3) is not POSIX compliant, and does not conform to linux man-pages manual | ||
---|---|---|---|
Product: | glibc | Reporter: | Steffen Nurpmeso <steffen> |
Component: | libc | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED INVALID | ||
Severity: | normal | CC: | bruno, drepper.fsp, rrt |
Priority: | P2 | ||
Version: | 2.36 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: |
Description
Steffen Nurpmeso
2022-12-16 23:03:28 UTC
I'm the maintainer of Recode (formerly GNU Recode), the widely-used character conversion utility. I came across this odd behaviour some years ago, but I only just realised that it is in fact a bug in glibc. My analysis is the same as the reporter's: the POSIX standard says unambiguously that EILSEQ is only returned for invalid input, and when an exact match to the output character set is not possible, an implementation-dependent conversion is performed. A very simple example using the iconv(1) program: $ hd foo.data 00000000 c2 b4 |..| 00000002 # This is ACUTE ACCENT U+00B4 $ iconv -f UTF-8 -t ISO-8859-15 foo.data iconv: illegal input sequence at position 0 # This is wrong! The input is valid UTF-8 $ iconv -f UTF-8 -t ISO-8859-15//TRANSLIT foo.data ' # This is the output one might expect in the previous case $ iconv -f UTF-8 -t ISO-8859-1 ~/Downloads/foo.data | hd 00000000 b4 |.| 00000001 # As we'd expect, as ACUTE ACCENT exists in ISO-8859-1 As far as I can see from looking at the code, the conversion code from Unicode to ISO-8859-15 is handled by iconvdata/8bit-gap.c. When it cannot find an ISO-8859-15 equivalent for the given UCS4 character, it calls STANDARD_TO_LOOP_ERR_HANDLER. This sets the error to __GCONV_ILLEGAL_INPUT, which is eventually converted to EILSEQ. This is wrong! STANDARD_TO_LOOP_ERR_HANDLER should use some other error code. I cannot see a suitable one in the present set (enum of __GCONV_* in iconv/gconv.h). Some thoughts about remedying the defect: 1. I guess that the current behaviour needs to be retained in some form, because clients will rely on it. In particular, it gives a way to detect when precise conversion is not possible, which iconv's spec does not. 2. However, the current behaviour is a problem for portable programs like Recode, that need to work with multiple iconv implementations. And, it's a bug! 3. The simplest "implementation-dependent conversion" would be to act as if either //IGNORE or //TRANSLIT behaviour had been requested. It shall simply put a ? (musl uses *), or maybe a configurable character. Some libraries then put a ? for each byte, other one for the complete sequence that is skipped over. ("Normally" the converter "knows" about the character so much that the latter strives me a good thing. Like //TRANSLIT does.) Yes. I guess the problem is that in "real life" the problem likely does not occur in that form. Or the people work around it somehow. For example, in "my" Linux distribution, they changed their pkg like - bsdtar -c $COMPRESSION -f $TARGET * && bsdtar -t -v -f $TARGET + bsdtar --format=gnutar -c $COMPRESSION -f $TARGET * && bsdtar -t -v -f $TARGET because some release balls seem to contain falsely encoded paths. (So that the -- correct! and _very_ complicated!! -- libarchive character conversion correctly bails. But the above is easier to handle than doing upstream reports, and gives immediate success. (The bogus path on the disc .. i do not know. I did not use those packages once the problem was circumvented.) > i have reported this in the past but the issue was closed. This was in https://sourceware.org/bugzilla/show_bug.cgi?id=22908 . Please mark this bug as related to #22908. > POSIX defined EILSEQ only for > > [EILSEQ] Input conversion stopped due to an input byte that does not belong to the input codeset. This sentence only means that when /input conversion stopped due to an input byte that does not belong to the input codeset/, the function shall fail with error EILSEQ. It does *not* forbid the function to fail with error EILSEQ for other reasons. It also does *not* forbid the function to fail with other error values for other reasons. This is not specific to iconv; it holds for all functions specified by POSIX. See https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap01.html section 1.2. > The Linux man-pages 6.01 manual (2022‐10‐09) says the same. Nope, it does not say so. According to your interpretation, where this man page says "The conversion can stop for four reasons" you would like to add a 5th case. According to my interpretation of the man page (and I wrote that man page originally), "An invalid multibyte sequence is encountered in the input" may also - depending on the implementation - include the case of input that cannot be meaningfully converted, neither in a reversible nor in a nonreversible way. In summary: Please close this ticket as INVALID. (In reply to Reuben Thomas from comment #2) > 1. I guess that the current behaviour needs to be retained in some form, > because clients will rely on it. Correct. And GNU libiconv (a different implementation of iconv, for systems that have a deficient iconv implementation) implements the same behaviour. > 2. However, the current behaviour is a problem for portable programs like > Recode, that need to work with multiple iconv implementations. If you need code that works with multiple iconv implementations, take a look at gnulib/lib/unicodeio.c lines 137..154 or gnulib/lib/striconveh.c lines 950..962. You see that the problem is that replacing unknown or inconvertible inputs with '?' or '*' or NUL is - just not yielding practically useful behaviour (especially because the caller then cannot transform a buffer all at once, a purpose for which the iconv function was initially designed), - requiring platform dependent recognition heuristics. Do not know how to relate (unless you did by noting). Linux man says The conversion can stop for four reasons then the only thing that may match is An invalid multibyte sequence is encountered in the input and that is not what is going on. It is not an invalid input. And no, iconv users surely always have to be prepared for a loop i would say, just in case the input has a problem and needs to be replaced with a replacement character. That gnulib snippet is terrible. I have such a thing also in order to be able to perform an iconv test (we pass through what the lib does). For example, this snippet was in the program i took maintainership over before 2004: /* * Fault-tolerant iconv() function. */ static size_t iconv_ft(iconv_t cd, char **inb, size_t *inbleft, char **outb, size_t *outbleft) { size_t sz = 0; while ((sz = iconv(cd, inb, inbleft, outb, outbleft)) == (size_t)-1 && (errno == EILSEQ || errno == EINVAL)) { if (*inbleft > 0) { (*inb)++; (*inbleft)--; } else { **outb = '\0'; break; } if (*outbleft > 0) { *(*outb)++ = '?'; (*outbleft)--; } else { **outb = '\0'; break; } } return sz; } Instead GNU should have reused the EINVAL error for this case. Or IO, NODATA, NOENT, NOMSG, NOTSUP, NOSYS, NOTOBACCO. Anyhow, that gnulib snippet was a shock. What a mess. The problem with the GNU approach is that portable software that glues to the POSIX standard and/or reads the Linux manual has to perform a lot of checks in order to find out whether the native iconv supports / wants //TRANSLIT to get the behaviour that the standard describes. At least in my opinion. And, as you say, all others but GNU follow this. (In reply to Bruno Haible from comment #4) > > > > [EILSEQ] Input conversion stopped due to an input byte that does not belong to the input codeset. > > This sentence only means that when /input conversion stopped due to an input > byte that does not belong to the input codeset/, the function shall fail > with error EILSEQ. It does *not* forbid the function to fail with error > EILSEQ for other reasons. It also does *not* forbid the function to fail > with other error values for other reasons. > > This is not specific to iconv; it holds for all functions specified by > POSIX. See > https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/ > V2_chap01.html section 1.2. I have read this section through several times, in particular the sections on "ERRORS" and "RETURN VALUE" and I can't see anything relevant, sorry; please could you elaborate? (In reply to Bruno Haible from comment #4) > > According to my interpretation of the man page (and I wrote that man page > originally), "An invalid multibyte sequence is encountered in the input" may > also - depending on the implementation - include the case of input that > cannot be meaningfully converted, neither in a reversible nor in a > nonreversible way. Sorry, but this is an unwarranted interpretation. It's unreasonable without extra explanation to expect the reader to recognize that "invalid" refers to the wider context of the conversion. The fact that it says "invalid multibyte sequence" reinforces this impression: if your interpretation were correct, then iconv would not be expected to return EILSEQ when a single-byte sequence was not translatable, only when a multibyte sequence is untranslatable. I'll file a separate bug about the documentation. The glibc manual also, as far as I can see, does not document the actual (useful!) behaviour. (In reply to Bruno Haible from comment #5) > > If you need code that works with multiple iconv implementations, take a look > at gnulib/lib/unicodeio.c lines 137..154 or gnulib/lib/striconveh.c lines > 950..962. You see that the problem is that replacing unknown or > inconvertible inputs with '?' or '*' or NUL is > - just not yielding practically useful behaviour (especially because the > caller then cannot transform a buffer all at once, a purpose for which the > iconv function was initially designed), > - requiring platform dependent recognition heuristics. For those who need to work with multiple implementations, it looks like this code could usefully be exposed in its own gnulib API. Since most of the problems I've had with Recode since taking it over have arisen from iconv, and coping with different implementations just makes it worse, I think I will retreat to using GNU libiconv (which Recode used to use) where at least I only have one implementation to deal with. I mean the GNU approach definetely has merits. If it only would not be automatic, but require //OUCNVERR or some other hypothetic explicit configuration. As it stands GNU stands out with its behaviour, and i as a programmer do not know how to differentiate in between an input ILSEQ (dramatical!) or and output ILSEQ (email use case might try different character set). I can maybe a bit -- if i know for sure that the iconv i use is the GNU one, which might not be true in practice (though i know of no other dynamic library that can replace it, only of libc-built-in and GNU iconv lib choices). If only it were a dedicated errno value. For me the need to go //TRANSLIT is a well hm painful GNU-specific need and way, and it shall be noted it is "transliteration": something entirely different than "an implementation-defined conversion on this character" that in reality is either * or ?. It could do whatever, say turning a hypothetic calligraphic "tiger protects the house" with a download link for a book of Dostojewski or something. How can i test this?? How can i as a programmer write a test that tests my program works correctly regarding iconv if i have to use //TRANSLIT that may change behind the lines and "improve" the transliteration because someone spend time on some character set and found a better one? I currently use "U+1FA78/f0 9f a9 b9/;DROP OF BLOOD" which right now works everywhere, but //TRANSLIT may turn it to an embedded picture of Bela Lugosi? Nosferatu? iconv could do much more for programmers anyway. For example email software has to know whether an actual character set is, in fact, US-ASCII, and the iconv implementation surely knows. Yet it does not expose an API for this particular thing ("official name"). Like normalize_name(), and i have a dedicated is_ascii like /* In reversed MIME preference order */ static char const * const names[] = {"csASCII", "cp367", "IBM367", "us", "ISO646-US", "ISO_646.irv:1991", "ANSI_X3.4-1986", "iso-ir-6", "ANSI_X3.4-1968", "ASCII", "US-ASCII"}; I am pretty sure GNU iconv will map all those names to the thing. Actually i have forgotten about https://austingroupbugs.net/view.php?id=1007 because the behaviour bugs me. Sorry. P.S.: glibc is wrong wrong wrong! It should NOT NOT NOT give an ILSEQ for output conversion! I know mbrtowc does, there this surely comes from; but that sits upon a valid input character! Having invalid, broken, illegal input is a dramatic failure! Not being able to convert valid input to another character set is entirely different. (Sebor said de facto the same for the POSIX standard issue, in 2016.) P.P.S.: sorry for the noise! But now, in order to deal with that (as the GNU approach has its merits, really), i downloaded GNU libiconv, and in wchar_to_loop_convert() i see size_t res = unicode_loop_convert(&wcd->parent, &inptr,&inleft, &bufptr,&bufleft); if (res == (size_t)(-1)) { if (errno == EILSEQ) /* Invalid input. */ And so i stop because i wholeheartly agree. I hope it is ok to assume that matching __GNU_LIBRARY__ and _LIBICONV_VERSION (unfortunately this is all compile-time only) is all the way to go to get EILSEQ upon output conversion error? Thank you. |