Bug 6050 - iconv(1) buffers all of stdin in memory
Summary: iconv(1) buffers all of stdin in memory
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: unspecified
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-08 18:47 IST by Daniel Richard G.
Modified: 2015-08-29 20:27 IST (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Richard G. 2008-04-08 18:47:12 IST
When reading from stdin, glibc's iconv(1) frontend buffers the entire input in 
memory, which makes the program unsuitable for very large inputs.

(This problem does not arise when reading a file directly, presumably because 
the input is mmap()ed in that case.)

$ iconv -f utf8 -t ucs2 <lots-o-gigabytes-utf8.txt >/dev/null
Killed

Confirmed with 2.6.1.
Comment 1 Rich Felker 2013-08-10 16:01:22 IST
Ping. This bug still exists. I traced it to a comment in the source:

/* we have a problem with reading from a desriptor since we must not
   provide the iconv() function an incomplete character or shift
   sequence at the end of the buffer.  Since we have to deal with
   arbitrary encodings we must read the whole text in a buffer and
   process it in one step.  */

See http://sourceware.org/git/?p=glibc.git;a=blob;f=iconv/iconv_prog.c;h=1a1d0d0cf45c0d747a8090bc234addd9e49f1ba7;hb=HEAD#l561

The claims made in the comment are simply erroneous. Per POSIX, the iconv function returns (size_t)-1 with errno set to EINVAL to indicate "Input conversion stopped due to an incomplete character or shift sequence at the end of the input buffer." This is a different condition from EILSEQ, and thus the caller can detect and recover from it simply by moving the remaining bytes of the input buffer to the beginning, re-filling the buffer, and calling iconv again.

If glibc's iconv function does not support this behavior correctly, that's a library-level bug which should be filed separately and fixed.
Comment 2 Ondrej Bilka 2013-10-07 13:58:09 IST
On Sat, Aug 10, 2013 at 04:01:22PM +0000, bugdal at aerifal dot cx wrote:
> http://sourceware.org/bugzilla/show_bug.cgi?id=6050
> 
> Rich Felker <bugdal at aerifal dot cx> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bugdal at aerifal dot cx
> 
> --- Comment #1 from Rich Felker <bugdal at aerifal dot cx> ---
> Ping. This bug still exists. I traced it to a comment in the source:
> 
> /* we have a problem with reading from a desriptor since we must not
>    provide the iconv() function an incomplete character or shift
>    sequence at the end of the buffer.  Since we have to deal with
>    arbitrary encodings we must read the whole text in a buffer and
>    process it in one step.  */
> 
> See
> http://sourceware.org/git/?p=glibc.git;a=blob;f=iconv/iconv_prog.c;h=1a1d0d0cf45c0d747a8090bc234addd9e49f1ba7;hb=HEAD#l561
> 
> The claims made in the comment are simply erroneous. Per POSIX, the iconv
> function returns (size_t)-1 with errno set to EINVAL to indicate "Input
> conversion stopped due to an incomplete character or shift sequence at the end
> of the input buffer." This is a different condition from EILSEQ, and thus the
> caller can detect and recover from it simply by moving the remaining bytes of
> the input buffer to the beginning, re-filling the buffer, and calling iconv
> again.
>
This would work only for stateless encodings. You cannot do this with
ISO-2022-JP as you would need additional argument to save state.
Comment 3 Rich Felker 2013-10-07 14:06:49 IST
On Mon, Oct 07, 2013 at 01:58:09PM +0000, neleai at seznam dot cz wrote:
> This would work only for stateless encodings. You cannot do this with
> ISO-2022-JP as you would need additional argument to save state.

No, the state is saved in the conversion descriptor, and iconv(3)
reports to the caller the exact point at which it stopped in the
input, so you simply resume from that point.
Comment 4 Ondrej Bilka 2013-10-09 07:50:21 IST
ah, then it is ok, could you write a patch?