Bug 19404 - Treatment of combining characters by iconv not documented well
Summary: Treatment of combining characters by iconv not documented well
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: manual (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-12-27 11:47 UTC by Gavin Smith
Modified: 2015-12-27 11:47 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gavin Smith 2015-12-27 11:47:09 UTC
In some character encodings like cp-1255, there are combining characters that can be combined with the preceding character: for example, to represent accents or vowel points.

When iconv processes input from such an encoding, it may not output a character from the input until it sees whether a combining character follows which would have to be combined with the character. At the end of the input, it is necessary to call iconv with null input arguments to flush the last character.

The manual (in manual/charset.texi, node Generic Conversion Interface) doesn't document this well. It says the following:

"If INBUF is a null pointer, the `iconv' function performs the
     necessary action to put the state of the conversion into the
     initial state.

...

"Therefore an `iconv' call to reset the state should always
     be performed if some protocol requires this for the output text"


This does not obviously apply for combining characters. In this case every non-combining, graphical character is simultaneously a shift character and not a shift character: a shift character when a combining character comes after it, and not a shift character when a combining character doesn't come after it or it occurs at the end of the input. This is not what people have in mind when they read about "shift sequences". The manual explains that the shift state is reset for the output, but not that graphical characters may be waiting to be output.

Moreover, the following in the manual is misleading:

"If all input from the input buffer is successfully converted and
     stored in the output buffer, the function returns the number of
     non-reversible conversions performed."

This is not true because a positive return value is possible while a character from the input waits in the iconv state, and is not stored in the output buffer.

The extra call to iconv was missing for wget (see http://lists.gnu.org/archive/html/bug-wget/2015-12/msg00110.html) and info (see https://lists.gnu.org/archive/html/bug-texinfo/2015-12/msg00010.html).