In glibc 2.19 and glibc 2.22, I observe that combining //translit with //ignore with -c with an unconvertible sequence will hang the iconv program. The order of //translit and //ignore is significant. The target charset is of no significance; it could also be -t utf-8//translit//ignore.
echo -en '\x80' | iconv -f us-ascii -t us-ascii//translit//ignore -c
Thanks for reporting this.
I tested a glibc based on 2.12 upstream, without any of the gconv fixes from the last years, and the problem is still there. This does not look like a regression.
echo -en "\x0e\x0e" | /usr/bin/iconv -c -f IBM1364
We received an independent report of this issue, and I think we should treat this as a minor security bug.
Created attachment 9875 [details]
attempt to fix CVE-2016-10228
This is my attempt to fix that bug by advancing the buffer whenever an illegal character is detected by one byte. While this does fix the specific examples in this bug report I am not sure whether it is a good idea since I am not particularly deep in that code.
Concerns that come to my mind and should be commented upon by someone that is more familiar with that code:
1. Is the place I fixed here the only place that needs fixing or did I probably miss other places.
2. Is advancing by one byte generally a good idea? Some character sets operate on multiple byte characters. Do we need to consider this here and probably advance with the character size instead? Is there information available in some data structure about how many bytes we would need to advance?
Created attachment 9876 [details]
Add missing brackets.
Sorry, the indention style tricked me into missing the brackets before.
>2. Is advancing by one byte generally a good idea?
In case where each codepoint is known to be encoded as the same number of bytes, e.g. 4 octets in UTF-32, I would expect it to advance by that stride.
echo -en '\xff\xff\xff\xff' | iconv -f utf-32 -t utf-32//translit//ignore -c
(In reply to Jan Engelhardt from comment #5)
> >2. Is advancing by one byte generally a good idea?
> In case where each codepoint is known to be encoded as the same number of
> bytes, e.g. 4 octets in UTF-32, I would expect it to advance by that stride.
> echo -en '\xff\xff\xff\xff' | iconv -f utf-32 -t utf-32//translit//ignore -c
Exactly such a case I had in mind with my concern. Thus next task is to find a way to determine within that context the size of each code point in bytes. This might probably be a bit difficult given that this knowledge seems to be totally encapsulated within the iconv function call.
Ok, I experimented quite a bit to figure out the minimal number of bytes used to encode a code point but it seems I didn't really fully understand the underlying data structure and so far didn't find a working solution for all cases I tested.
Therefore for my own purposes I left the solution advancing one byte, leaving the risk that for some encodings the output can contain even more garbage than what was found in the input, following the concept "garbage in, garbage out". At least it still avoids the tool to hang in those cases.
I leave it to the experts of this code area to improve this fix or provide me a hint, where in the data structures I can find the information about what is the smallest number of bytes for one code point in the input data encoding.
I thought I'd chime in since I believe the independent report mentioned by Florian was from me.
Although it's not a generic fix for the underlying issue for a class of character sets that use shift in/shift out sequences (e.g. IBM930 and IBM933) some of them work as expected and some don't (I can't remember which one did ignore a duplicate shift in or out). For IBM930 the BODY macro looks in part like:
#define BODY \
uint32_t ch = *inptr; \
uint32_t res; \
if (__builtin_expect (ch, 0) == SO) \
/* Shift OUT, change to DBCS converter. */ \
if (curcs == db) \
result = __GCONV_ILLEGAL_INPUT; \
curcs = db; \
if this line:
if (curcs == db) \
was changed to:
if (curcs == db && (! ignore_errors_p ())) \
Then a duplicate shift out sequence in IBM930 would not cause the iconv command to loop forever with the -c option (the shift in code needs the same change).
But in general the issue could be resolved up by the code that does the conversions not returning an error when being told ignore errors. In the above example while it appears to be illegal to have duplicate shift in/out sequences when ignoring errors the duplicate should probably just be consumed and ignored.
Should there be a general rule that the character set conversion macros must do something with their input when faced with something invalid and they've been told to ignore errors? Then iconv doesn't need to know anything when set to ignore errors and it shouldn't loop.
For non-variable size multi-byte character sets the converter at least it knows based on the character set the number of bytes it should consume so it's easier for it to skip past any error (e.g. IBM930 knows if it should skip one or two bytes based on the current shift state).
I thought of these general rules when errors are ignored:
1. Character set converters using a fixed or know number of bytes will skip forward one character when encountering an invalid character (includes character sets that are always the one fixed size and character sets using shift states between different size fixed characters)
2. Invalid structural data will be consumed (e.g. duplicate shift in/out states in character sets like IBM930)
3. Variable size character sets will consume everything in from the start of a character up to and including the byte that made the character invalid (e.g. in SJIS if the sequence of bytes 0x81 0x20 was seen both bytes would be consumed not just the 0x81 but that could instead easily be advance one byte and start again).
Those rules are entirely open to discussion though - I've probably missed some conditions that need to be covered. The downside would be that there are a lot of macros to review to make sure that they would all follow a defined set of rules and may it change their behaviour.
Anything following a conversion failure should really be considered undefined anyway (as someone else said garbage in garbage out) so how bad data gets treated can follow set rules as long as they are followed consistently by everything.
Does it make more sense that the character set converters should decide what to drop, skip or quitely consume when being told to ignore errors vs doing it somewhere that doesn't really know much about any particular character set?
Sorry I should add that the IBM930 example was just that (an example). That's an illustration of a downstream issue not upstream since that code has been changed for that conversion to ignore duplicate shift in shift out characters. The same general class of issues exists in other things though. Should the conversion routines be allowed to return errors when told to ignore them (or if they do they can't return with the input pointer unchanged so forward progress can always be guaranteed).
Perhaps we should do the following:
(1) -c should add //IGNORE to the conversions if not already there.
(2) The conversion functions should be adjusted to make progress in //IGNORE. See bug 13541.
(3) If the iconv function does not make progress, skip over individual input bytes. With (2), this is only used as a last resort, to avoid a hang.
We can probably implement (1) and (3) for this bug. (2) is mostly a charset-specific quality issue after that. Even then, the scope of this bug is a bit larger than what I expected.
> echo -en '\x80' | iconv -f us-ascii -t us-ascii//translit//ignore -c
> echo -en "\x0e\x0e" | /usr/bin/iconv -c -f IBM1364
I have not investigated the second reproducer, but I am quite sure that it is completely unrelated to the first one.
The first one is caused by incorrect parsing of the "//" marked suffixes (//translit and //ignore). The problem is that the second suffix is being dropped when parsing. This is why the order matters. The case reported, i.e.:
$ echo -en '\x80' | iconv -f us-ascii -t us-ascii//translit//ignore -c
is one where the "//translit" suffix is actually irrelevant because the input does not have characters that can/should be transliterated, but its presence as the first suffix leads to the actually relevant "//ignore" suffix (0x80 is not valid ASCII) being *ignored* by iconv, leading to a hang.
This is why, these four similar cases:
$ echo -en '\x80' | iconv -f us-ascii -t us-ascii//ignore//translit -c
$ echo -en '\x80' | iconv -f us-ascii -t us-ascii//ignore -c
$ echo -en '\x80' | iconv -f us-ascii -t us-ascii -c
$ echo -en '\x80' | iconv -f us-ascii -t us-ascii//translit -c
do not hang. The '//ignore' suffix being first (or implied via -c) gets picked up by the program and it correctly outputs nothing, and the "//translit", if present, is dropped (incorrectly) - but this has no effect for this particular input on the program's running because like I mentioned earlier: there is nothing to be transliterated in the input. Note that '-c' actually appends the "//ignore" to the target encoding suffix if no suffixes are present, and ",ignore" if one or more suffixes are already present. If two suffixes are passed during input, then the ',ignore' appended because of '-c' will get dropped along with the second suffix, which may or may not lead to hang depending on whether the first suffix is //ignore.
The hang in case of a missing "//ignore" suffix (well, not really missing here but ignored by the program) afterwards during execution is a separate problem (that I haven't investigated).
I'm going to write a patch to fix the problem with the parsing at the right place(s), and then tackle the hangs separately.
I quickly want to point out that passing an input character that requires transliteration and then passing //translit as the second suffix also leads to erroneous program behaviour (it reports an error instead of performing transliteration). Some test cases:
Exits successfully with 'A' written to stdout (because //translit is the only or first suffix):
$ echo -en '\xc3\x81' | iconv -f utf8 -t "us-ascii//translit"
$ echo -en '\xc3\x81' | iconv -f utf8 -t "us-ascii//translit//ignore"
No transliteration and corresponding "illegal input sequence" error thrown, even though //translit is passed, but it is incorrectly ignored by the program because it is not the first suffix:
$ echo -en '\xc3\x81' | iconv -f utf8 -t "us-ascii//ignore//translit"
Exits without error messages but (incorrectly) writes nothing to stdout. This is because 'translit' got ignored (2nd suffix), but '-c' leads to silently dropping the input character that wasn't transliterated:
$ echo -en '\xc3\x81' | iconv -f utf8 -t "us-ascii//ignore//translit" -c
Created attachment 11786 [details]
Test to perform every possible 2-byte conversion from all charsets to UTF-8
I wrote a bash script that tests every possible 2-byte conversion from every supported charset to UTF-8. It tests with all possible combinations of "-c" "//TRANSLIT" and "//IGNORE" and takes tens of hours to run.
Created attachment 11787 [details]
Result of test run on Fedora 29 (release/2.28/master commit: 4a5e5882)
I ran the test on Fedora 29 (release/2.28/master commit: 4a5e5882) and found hangs in more than half the supported character sets when converting to UTF-8. Character sets listed as "OK" exhibited no hangs across the entire input domain. Character sets that hung weren't tested past the first hang.
I expect that a lot (but not all) of these hangs are just duplicates; i.e. they will vanish together with the same fix.