Bug 29913

Summary: iconv(3) is not POSIX compliant, and does not conform to linux man-pages manual
Product: glibc Reporter: Steffen Nurpmeso <steffen>
Component: libcAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED INVALID    
Severity: normal CC: bruno, drepper.fsp, rrt
Priority: P2    
Version: 2.36   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Steffen Nurpmeso 2022-12-16 23:03:28 UTC
Disclaimer: i have reported this in the past but the issue was closed.

The problem is that without //TRANSLIT GNU iconv(3) fails to perform the

  If iconv( ) encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv( ) shall perform an implementation-defined conversion on this character.

POSIX iconv(3) (Vol. 2: System Interfaces, Issue 7) requirement.
Instead GNU libc returns EILSEQ which is wrong as POSIX defined EILSEQ only for

  [EILSEQ] Input conversion stopped due to an input byte that does not belong to the input codeset.

The Linux man-pages 6.01 manual (2022‐10‐09) says the same.  But GNU libc _does_ fail for EILSEQ without //TRANSLIT even if the input is valid UTF-8.
As can be seen by running this (shortened variant of a config test program).
I say "Bye!" already here, and hope it gets fixed!

#include <string.h>
#include <errno.h>
#include <stdio.h>
#include <iconv.h>
int main(void){
        char inb[16], oub[16], *inbp, *oubp;
        iconv_t id;
        size_t inl, oul;
        int rv;

        memcpy(inbp = inb, "\341\203\276", sizeof("\341\203\276"));
        inl = sizeof("\341\203\276") -1;
        oul = sizeof oub;
        oubp = oub;

        rv = 1;
        if((id = iconv_open("us-ascii"/*//TRANSLIT"*/, "utf-8")) == (iconv_t)-1)
                goto jleave;

        rv = 14;
        if(iconv(id, &inbp, &inl, &oubp, &oul) == (size_t)-1)
{
fprintf(stderr, "error %s %d==%d\n",strerror(errno),errno,errno==EILSEQ);
                goto jleave;
}

fprintf(stderr, "bummer\n");
jleave:
        if(id != (iconv_t)-1)
                iconv_close(id);

        return rv;
}
Comment 1 Reuben Thomas 2023-02-18 20:48:11 UTC
I'm the maintainer of Recode (formerly GNU Recode), the widely-used character conversion utility.

I came across this odd behaviour some years ago, but I only just realised that it is in fact a bug in glibc. My analysis is the same as the reporter's: the POSIX standard says unambiguously that EILSEQ is only returned for invalid input, and when an exact match to the output character set is not possible, an implementation-dependent conversion is performed.

A very simple example using the iconv(1) program:

$ hd foo.data
00000000  c2 b4                                             |..|
00000002
# This is ACUTE ACCENT U+00B4
$ iconv -f UTF-8 -t ISO-8859-15 foo.data
iconv: illegal input sequence at position 0
# This is wrong! The input is valid UTF-8
$ iconv -f UTF-8 -t ISO-8859-15//TRANSLIT foo.data
' # This is the output one might expect in the previous case
$ iconv -f UTF-8 -t ISO-8859-1 ~/Downloads/foo.data | hd
00000000  b4                                                |.|
00000001
# As we'd expect, as ACUTE ACCENT exists in ISO-8859-1

As far as I can see from looking at the code, the conversion code from Unicode to ISO-8859-15 is handled by iconvdata/8bit-gap.c. When it cannot find an ISO-8859-15 equivalent for the given UCS4 character, it calls STANDARD_TO_LOOP_ERR_HANDLER. This sets the error to __GCONV_ILLEGAL_INPUT, which is eventually converted to EILSEQ.  This is wrong!

STANDARD_TO_LOOP_ERR_HANDLER should use some other error code. I cannot see a suitable one in the present set (enum of __GCONV_* in iconv/gconv.h).
Comment 2 Reuben Thomas 2023-02-18 21:20:58 UTC
Some thoughts about remedying the defect:

1. I guess that the current behaviour needs to be retained in some form, because clients will rely on it. In particular, it gives a way to detect when precise conversion is not possible, which iconv's spec does not.

2. However, the current behaviour is a problem for portable programs like Recode, that need to work with multiple iconv implementations. And, it's a bug!

3. The simplest "implementation-dependent conversion" would be to act as if either //IGNORE or //TRANSLIT behaviour had been requested.
Comment 3 Steffen Nurpmeso 2023-02-18 22:43:31 UTC
It shall simply put a ? (musl uses *), or maybe a configurable character.
Some libraries then put a ? for each byte, other one for the complete sequence that is skipped over.  ("Normally" the converter "knows" about the character so much that the latter strives me a good thing.  Like //TRANSLIT does.)

Yes.  I guess the problem is that in "real life" the problem likely does not occur in that form.
Or the people work around it somehow.
For example, in "my" Linux distribution, they changed their pkg like

-               bsdtar -c $COMPRESSION -f $TARGET *  &&  bsdtar -t -v -f $TARGET
+               bsdtar --format=gnutar -c $COMPRESSION -f $TARGET *  &&  bsdtar -t -v -f $TARGET

because some release balls seem to contain falsely encoded paths.
(So that the -- correct! and _very_ complicated!! -- libarchive character conversion correctly bails.  But the above is easier to handle than doing upstream reports, and gives immediate success.  (The bogus path on the disc .. i do not know.  I did not use those packages once the problem was circumvented.)
Comment 4 Bruno Haible 2023-02-19 00:40:21 UTC
> i have reported this in the past but the issue was closed.

This was in https://sourceware.org/bugzilla/show_bug.cgi?id=22908 . Please mark this bug as related to #22908.

> POSIX defined EILSEQ only for
>
>  [EILSEQ] Input conversion stopped due to an input byte that does not belong to the input codeset.

This sentence only means that when /input conversion stopped due to an input byte that does not belong to the input codeset/, the function shall fail with error EILSEQ. It does *not* forbid the function to fail with error EILSEQ for other reasons. It also does *not* forbid the function to fail with other error values for other reasons.

This is not specific to iconv; it holds for all functions specified by POSIX. See https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap01.html section 1.2.

> The Linux man-pages 6.01 manual (2022‐10‐09) says the same.

Nope, it does not say so. According to your interpretation, where this man page says "The conversion can stop for four reasons" you would like to add a 5th case.

According to my interpretation of the man page (and I wrote that man page originally), "An invalid multibyte sequence is encountered in the input" may also - depending on the implementation - include the case of input that cannot be meaningfully converted, neither in a reversible nor in a nonreversible way.

In summary: Please close this ticket as INVALID.
Comment 5 Bruno Haible 2023-02-19 00:51:16 UTC
(In reply to Reuben Thomas from comment #2)
> 1. I guess that the current behaviour needs to be retained in some form,
> because clients will rely on it.

Correct. And GNU libiconv (a different implementation of iconv, for systems that have a deficient iconv implementation) implements the same behaviour.

> 2. However, the current behaviour is a problem for portable programs like
> Recode, that need to work with multiple iconv implementations.

If you need code that works with multiple iconv implementations, take a look at gnulib/lib/unicodeio.c lines 137..154 or gnulib/lib/striconveh.c lines 950..962. You see that the problem is that replacing unknown or inconvertible inputs with '?' or '*' or NUL is
- just not yielding practically useful behaviour (especially because the caller then cannot transform a buffer all at once, a purpose for which the iconv function was initially designed),
- requiring platform dependent recognition heuristics.
Comment 6 Steffen Nurpmeso 2023-02-19 01:58:39 UTC
Do not know how to relate (unless you did by noting).

Linux man says

  The conversion can stop for four reasons

then the only thing that may match is

  An invalid multibyte sequence is encountered in the input

and that is not what is going on.
It is not an invalid input.

And no, iconv users surely always have to be prepared for a loop i would say, just in case the input has a problem and needs to be replaced with a replacement character.

That gnulib snippet is terrible.  I have such a thing also in order to be able to perform an iconv test (we pass through what the lib does).
For example, this snippet was in the program i took maintainership over before 2004:


/*
 * Fault-tolerant iconv() function.
 */
static size_t
iconv_ft(iconv_t cd, char **inb, size_t *inbleft, char **outb, size_t *outbleft)
{
        size_t sz = 0;

        while ((sz = iconv(cd, inb, inbleft, outb, outbleft)) == (size_t)-1
                        && (errno == EILSEQ || errno == EINVAL)) {
                if (*inbleft > 0) {
                        (*inb)++;
                        (*inbleft)--;
                } else {
                        **outb = '\0';
                        break;
                }
                if (*outbleft > 0) {
                        *(*outb)++ = '?';
                        (*outbleft)--;
                } else {
                        **outb = '\0';
                        break;
                }
        }
        return sz;
}

Instead GNU should have reused the EINVAL error for this case.  Or IO, NODATA, NOENT, NOMSG, NOTSUP, NOSYS, NOTOBACCO.

Anyhow, that gnulib snippet was a shock.  What a mess.

The problem with the GNU approach is that portable software that glues to the POSIX standard and/or reads the Linux manual has to perform a lot of checks in order to find out whether the native iconv supports / wants //TRANSLIT to get the behaviour that the standard describes.

At least in my opinion.
And, as you say, all others but GNU follow this.
Comment 7 Reuben Thomas 2023-02-19 10:06:25 UTC
(In reply to Bruno Haible from comment #4)
> >
> >  [EILSEQ] Input conversion stopped due to an input byte that does not belong to the input codeset.
> 
> This sentence only means that when /input conversion stopped due to an input
> byte that does not belong to the input codeset/, the function shall fail
> with error EILSEQ. It does *not* forbid the function to fail with error
> EILSEQ for other reasons. It also does *not* forbid the function to fail
> with other error values for other reasons.
> 
> This is not specific to iconv; it holds for all functions specified by
> POSIX. See
> https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/
> V2_chap01.html section 1.2.

I have read this section through several times, in particular the sections on "ERRORS" and "RETURN VALUE" and I can't see anything relevant, sorry; please could you elaborate?
Comment 8 Reuben Thomas 2023-02-19 10:15:41 UTC
(In reply to Bruno Haible from comment #4)
> 
> According to my interpretation of the man page (and I wrote that man page
> originally), "An invalid multibyte sequence is encountered in the input" may
> also - depending on the implementation - include the case of input that
> cannot be meaningfully converted, neither in a reversible nor in a
> nonreversible way.

Sorry, but this is an unwarranted interpretation. It's unreasonable without extra explanation to expect the reader to recognize that "invalid" refers to the wider context of the conversion. The fact that it says "invalid multibyte sequence" reinforces this impression: if your interpretation were correct, then iconv would not be expected to return EILSEQ when a single-byte sequence was not translatable, only when a multibyte sequence is untranslatable.

I'll file a separate bug about the documentation. The glibc manual also, as far as I can see, does not document the actual (useful!) behaviour.
Comment 9 Reuben Thomas 2023-02-19 10:22:08 UTC
(In reply to Bruno Haible from comment #5)
> 
> If you need code that works with multiple iconv implementations, take a look
> at gnulib/lib/unicodeio.c lines 137..154 or gnulib/lib/striconveh.c lines
> 950..962. You see that the problem is that replacing unknown or
> inconvertible inputs with '?' or '*' or NUL is
> - just not yielding practically useful behaviour (especially because the
> caller then cannot transform a buffer all at once, a purpose for which the
> iconv function was initially designed),
> - requiring platform dependent recognition heuristics.

For those who need to work with multiple implementations, it looks like this code could usefully be exposed in its own gnulib API.

Since most of the problems I've had with Recode since taking it over have arisen from iconv, and coping with different implementations just makes it worse, I think I will retreat to using GNU libiconv (which Recode used to use) where at least I only have one implementation to deal with.
Comment 10 Steffen Nurpmeso 2023-02-19 22:57:00 UTC
I mean the GNU approach definetely has merits.
If it only would not be automatic, but require //OUCNVERR
or some other hypothetic explicit configuration.

As it stands GNU stands out with its behaviour, and i as
a programmer do not know how to differentiate in between an input
ILSEQ (dramatical!) or and output ILSEQ (email use case might try
different character set).  I can maybe a bit -- if i know for sure
that the iconv i use is the GNU one, which might not be true in
practice (though i know of no other dynamic library that can
replace it, only of libc-built-in and GNU iconv lib choices).
If only it were a dedicated errno value.

For me the need to go //TRANSLIT is a well hm painful GNU-specific
need and way, and it shall be noted it is "transliteration":
something entirely different than "an implementation-defined
conversion on this character" that in reality is either * or ?.
It could do whatever, say turning a hypothetic calligraphic "tiger
protects the house" with a download link for a book of Dostojewski
or something.

How can i test this??
How can i as a programmer write a test that tests my program works
correctly regarding iconv if i have to use //TRANSLIT that may
change behind the lines and "improve" the transliteration because
someone spend time on some character set and found a better one?
I currently use "U+1FA78/f0 9f a9 b9/;DROP OF BLOOD" which right
now works everywhere, but //TRANSLIT may turn it to an embedded
picture of Bela Lugosi?  Nosferatu?
Comment 11 Steffen Nurpmeso 2023-02-19 23:02:47 UTC
iconv could do much more for programmers anyway.
For example email software has to know whether an actual character set is, in fact, US-ASCII, and the iconv implementation surely knows.
Yet it does not expose an API for this particular thing ("official name").
Like normalize_name(), and i have a dedicated is_ascii like

        /* In reversed MIME preference order */
        static char const * const names[] = {"csASCII", "cp367", "IBM367", "us",
                        "ISO646-US", "ISO_646.irv:1991", "ANSI_X3.4-1986", "iso-ir-6",
                        "ANSI_X3.4-1968", "ASCII", "US-ASCII"};

I am pretty sure GNU iconv will map all those names to the thing.
Comment 12 Steffen Nurpmeso 2023-02-20 20:09:11 UTC
Actually i have forgotten about
https://austingroupbugs.net/view.php?id=1007
because the behaviour bugs me.
Sorry.
Comment 13 Steffen Nurpmeso 2023-02-20 20:54:52 UTC
P.S.:
glibc is wrong wrong wrong!
It should NOT NOT NOT give an ILSEQ for output conversion!

I know mbrtowc does, there this surely comes from; but that sits upon a valid input character!

Having invalid, broken, illegal input is a dramatic failure!
Not being able to convert valid input to another character set is entirely different.
(Sebor said de facto the same for the POSIX standard issue, in 2016.)
Comment 14 Steffen Nurpmeso 2023-02-20 21:52:18 UTC
P.P.S.: sorry for the noise!
But now, in order to deal with that (as the GNU approach has its merits, really), i downloaded GNU libiconv, and in wchar_to_loop_convert() i see

     size_t res = unicode_loop_convert(&wcd->parent,
                                        &inptr,&inleft,
                                        &bufptr,&bufleft);
      if (res == (size_t)(-1)) {
        if (errno == EILSEQ)
          /* Invalid input. */

And so i stop because i wholeheartly agree.

I hope it is ok to assume that matching __GNU_LIBRARY__ and _LIBICONV_VERSION (unfortunately this is all compile-time only) is all the way to go to get EILSEQ upon output conversion error?

Thank you.