Bug 6050

Summary: iconv(1) buffers all of stdin in memory
Product: glibc Reporter: Daniel Richard G. <skunk>
Component: localeAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: bugdal, glibc-bugs, neleai
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:

Description Daniel Richard G. 2008-04-08 18:47:12 UTC
When reading from stdin, glibc's iconv(1) frontend buffers the entire input in 
memory, which makes the program unsuitable for very large inputs.

(This problem does not arise when reading a file directly, presumably because 
the input is mmap()ed in that case.)

$ iconv -f utf8 -t ucs2 <lots-o-gigabytes-utf8.txt >/dev/null
Killed

Confirmed with 2.6.1.
Comment 1 Rich Felker 2013-08-10 16:01:22 UTC
Ping. This bug still exists. I traced it to a comment in the source:

/* we have a problem with reading from a desriptor since we must not
   provide the iconv() function an incomplete character or shift
   sequence at the end of the buffer.  Since we have to deal with
   arbitrary encodings we must read the whole text in a buffer and
   process it in one step.  */

See http://sourceware.org/git/?p=glibc.git;a=blob;f=iconv/iconv_prog.c;h=1a1d0d0cf45c0d747a8090bc234addd9e49f1ba7;hb=HEAD#l561

The claims made in the comment are simply erroneous. Per POSIX, the iconv function returns (size_t)-1 with errno set to EINVAL to indicate "Input conversion stopped due to an incomplete character or shift sequence at the end of the input buffer." This is a different condition from EILSEQ, and thus the caller can detect and recover from it simply by moving the remaining bytes of the input buffer to the beginning, re-filling the buffer, and calling iconv again.

If glibc's iconv function does not support this behavior correctly, that's a library-level bug which should be filed separately and fixed.
Comment 2 Ondrej Bilka 2013-10-07 13:58:09 UTC
On Sat, Aug 10, 2013 at 04:01:22PM +0000, bugdal at aerifal dot cx wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=6050
> 
> Rich Felker <bugdal at aerifal dot cx> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bugdal at aerifal dot cx
> 
> --- Comment #1 from Rich Felker <bugdal at aerifal dot cx> ---
> Ping. This bug still exists. I traced it to a comment in the source:
> 
> /* we have a problem with reading from a desriptor since we must not
>    provide the iconv() function an incomplete character or shift
>    sequence at the end of the buffer.  Since we have to deal with
>    arbitrary encodings we must read the whole text in a buffer and
>    process it in one step.  */
> 
> See
> http://sourceware.org/git/?p=glibc.git;a=blob;f=iconv/iconv_prog.c;h=1a1d0d0cf45c0d747a8090bc234addd9e49f1ba7;hb=HEAD#l561
> 
> The claims made in the comment are simply erroneous. Per POSIX, the iconv
> function returns (size_t)-1 with errno set to EINVAL to indicate "Input
> conversion stopped due to an incomplete character or shift sequence at the end
> of the input buffer." This is a different condition from EILSEQ, and thus the
> caller can detect and recover from it simply by moving the remaining bytes of
> the input buffer to the beginning, re-filling the buffer, and calling iconv
> again.
>
This would work only for stateless encodings. You cannot do this with
ISO-2022-JP as you would need additional argument to save state.
Comment 3 Rich Felker 2013-10-07 14:06:49 UTC
On Mon, Oct 07, 2013 at 01:58:09PM +0000, neleai at seznam dot cz wrote:
> This would work only for stateless encodings. You cannot do this with
> ISO-2022-JP as you would need additional argument to save state.

No, the state is saved in the conversion descriptor, and iconv(3)
reports to the caller the exact point at which it stopped in the
input, so you simply resume from that point.
Comment 4 Ondrej Bilka 2013-10-09 07:50:21 UTC
ah, then it is ok, could you write a patch?