Bug 12701 - scanf accepts non-matching input
Summary: scanf accepts non-matching input
Status: REOPENED
Alias: None
Product: glibc
Classification: Unclassified
Component: stdio (show other bugs)
Version: unspecified
: P2 critical
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
: 1765 12437 22672 23274 33730 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-04-25 15:13 UTC by Rich Felker
Modified: 2025-12-22 17:59 UTC (History)
15 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:
fweimer: security+


Attachments
scanf test cases (2.05 KB, text/plain)
2012-04-18 05:21 UTC, Rich Felker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rich Felker 2011-04-25 15:13:20 UTC
glib'c scanf function incorrectly handles cases where it reads a sequence of characters which are an initial subsequence of a matching sequence, but not actually a matching sequence, for the conversion specifier. Examples include:

sscanf("abc", "%4c", buf) returns 1 instead of 0 or EOF (not sure which is correct) and leaves no way for the caller to know buf[3] is unfilled.

sscanf("0xz", "%x%c", &x, &c) returns 2 instead of 0.

sscanf("1.0e+!", "%f%c", &x, &c) returns 2 instead of 0.

etc.
Comment 1 Ulrich Drepper 2011-05-02 01:40:09 UTC
All of these cases are correctly handled.

scanf is badly designed, just don't use it if you cannot live with these results.
Comment 2 Rich Felker 2011-05-02 02:35:40 UTC
They are not correctly handled. Please refer to C99, 7.19.6.2, paragraph 9, which defines an input item as:

"the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence"

Paragraph 10 then reads:

"If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure."

Clearly in the case of sscanf("0xz", "%x%c", &x, &c), the first "input item" is "0x", and it is not a matching sequence for the %x conversion (see the specification of strtoul, in terms of which scanf %x is specified), so the result must be a matching failure.

If you're going to wrongly mark this bug as "RESOLVED", at least mark it "WONTFIX" rather than "INVALID" and acknowledge that it's a bug that you're unwilling to fix, and that glibc is intentionally non-conformant in this matter.
Comment 3 Ulrich Drepper 2011-05-03 00:30:31 UTC
They are handled correctly.  You don't understand the limit of push backs.
Comment 4 Rich Felker 2011-05-03 00:40:26 UTC
Yes I understand pushbacks.

Scanning "0xz" for %x results in an input item of "0x" with "z" pushed back into the unread buffer. The bug has nothing to do with pushbacks, because the right data is pushed back. The bug is that a non-matching input item is treated as a match rather than a matching error.

Perhaps you thought I was saying the input item should be "0", successfully converted, with "x" as the next unread character in the buffer. Of course this is wrong and I do not believe such a thing.

Perhaps you should try reading the actual language standard rather than assuming you're right.
Comment 5 Ulrich Drepper 2011-05-03 01:13:29 UTC
(In reply to comment #4)
> Yes I understand pushbacks.

You apparently don't.   This is no place to get a free education.

Don't reopen the bug, there will be no change.
Comment 6 Rich Felker 2011-05-03 02:22:50 UTC
OK if you insist that I don't reopen it, I'm fixing the resolution to "WONTFIX".
Comment 7 Rich Felker 2011-09-25 04:42:31 UTC
Reopening since I found a statement from an official source (Fred J. Tydeman, Vice-char of PL22.11) that the glibc behavior is incorrect:

http://newsgroups.derkeiler.com/Archive/Comp/comp.std.c/2009-09/msg00045.html

Sorry I don't have a better newsgroup archive link.
Comment 8 Ulrich Drepper 2011-10-29 17:14:29 UTC
What on earth are you talking about.  Fred said exactly the same: 0xz causes the z to be rejected for the %x and therefore used for the %c.  Stop wasting my time.
Comment 9 Rich Felker 2011-10-29 21:24:08 UTC
Apparently you only read the first quoted paragraph and not the second:

> > - the input item "0x" is not a matching sequence, so the execution of
> > the whole directive fails;
> 
> Correct

What part of "the execution of the whole directive fails" are you not understanding? When a directive fails, scanf stops and returns the number of directives successfully converted and stored. This number is zero, not two. The %c is never processed. glibc is wrong. Please fix it.

If you insist on keeping compatibility with hypothetical existing binaries that depend on the wrong behavior, that's what glibc has symbol versioning for...
Comment 10 Ulrich Drepper 2011-10-29 21:36:38 UTC
The behavior is correct and wanted.  Now stop wasting people's time.
Comment 11 Rich Felker 2011-10-30 05:42:51 UTC
Fred Tydeman (vice chair of PL22.11/J11) has stated as clearly and directly that the current glibc behavior is NOT correct. Whether it's wanted is a more subjective question, but I have not seen anyone but yourself who wants scanf to behave incorrectly in this manner. Please fix this bug.
Comment 12 Rich Felker 2012-03-17 20:39:24 UTC
Ping. Would somebody other than Mr. Drepper be willing to review this bug report?
Comment 13 Joseph Myers 2012-03-18 14:28:19 UTC
This bug report appears to be correct, and the erroneous behavior described still present with current glibc (tested x86_64).
Comment 14 Rich Felker 2012-04-18 05:21:24 UTC
Created attachment 6345 [details]
scanf test cases

I recently wrote a set of test cases for verifying my scanf implementation, and running it against glibc reproduces A LOT of instances of this bug... See attached test program.
Comment 15 paxdiablo 2012-11-26 08:26:04 UTC
I think this bug report is correct, at least in relation to the '%x/0xz' sample.

There's a big difference between an input item, which *may* be an initial subset of a properly scanned directive, and the *properly scanned directive* itself.

Pushback controls how far you can back up the "input stream pointer" and is the reason why scanf is usually not used by professionals, who prefer a fgets/sscanf combo so they can bak up to the start of the line themselves. However, the pushback is only relevant here in that context. The failure of '0x' when scanning '%x' will not be able to push back all the way to the '0' because of this limitation.

The function call sscanf ("a0xz", "%c%x%c") should return 1, not 3.

The controlling part of the standard is the bit dealing with the 'x' directive itself:

=====
Matches an optionally signed hexadecimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 16 for the base argument.
=====

The strtoul stuff states:

=====
If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.4.4.1, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36 (inclusive), the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 through 35; only letters and digits whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.
=====

The controlling part there would be "a sequence of letters and digits representing an integer" - you may argue that such a sequence may consist of zero characters but I don't think anyone in their right mind would suggest that definition represented an integer. In any case, the '0x' string fails on strtoul:

    char *x;
    int rc = 42;
    rc = strtoul ("0x", &x, 16);
    printf ("%d [%s]/n", rc, x);
produces:

    0 [0x]
So even though rc is set to 0, the fact that the pointer points to the first bad character means that the '0x' itself is not a valid hex number.

Putting in '0x5' as the string gives you:

    5 []
so that the first bad character is the end of the string (ie, there WERE no bad characters).
Comment 16 Florian Weimer 2014-06-13 14:54:27 UTC
(In reply to Rich Felker from comment #0)

> sscanf("abc", "%4c", buf) returns 1 instead of 0 or EOF (not sure which is
> correct) and leaves no way for the caller to know buf[3] is unfilled.

So this is an information leak.
Comment 17 Florian Weimer 2014-06-27 14:00:25 UTC
*** Bug 12437 has been marked as a duplicate of this bug. ***
Comment 18 Florian Weimer 2018-01-04 12:17:34 UTC
*** Bug 22672 has been marked as a duplicate of this bug. ***
Comment 19 Joseph Myers 2018-06-12 11:15:17 UTC
*** Bug 23274 has been marked as a duplicate of this bug. ***
Comment 20 Joseph Myers 2018-06-12 11:16:01 UTC
*** Bug 1765 has been marked as a duplicate of this bug. ***
Comment 21 Mark winds 2021-02-10 19:30:14 UTC Comment hidden (spam)
Comment 22 mark 2021-03-05 15:27:45 UTC Comment hidden (spam)
Comment 23 namboru 2021-09-15 02:40:36 UTC Comment hidden (spam)
Comment 24 Vincent Lefèvre 2023-07-18 12:01:35 UTC
Note that scanf also accepts "nan(" while it shouldn't (because "nan()" is valid), but for a different reason. See bug 30647 for issues related to scanf with nan.
Comment 25 Vincent Lefèvre 2024-07-19 14:11:40 UTC
(In reply to Rich Felker from comment #7)
> Reopening since I found a statement from an official source (Fred J.
> Tydeman, Vice-char of PL22.11) that the glibc behavior is incorrect:
> 
> http://newsgroups.derkeiler.com/Archive/Comp/comp.std.c/2009-09/msg00045.html

It no longer exists, but here's another link:

https://comp.std.c.narkive.com/7JdevQ08/fscanf-strtol-and-the-parsing-of-numbers
Comment 26 Avinal Kumar 2024-10-29 07:55:06 UTC
I was able to fix scanf behavior on NaN in bug 30647 and have been looking into this issue for a few weeks now. The opinion seems divided, but I would like to get a final nod before I start working on it.

To put everything in simple words:

- the input should always match the given specifier that is:
 - For "%[width]specifier" the input must be wide enough, or it is a failure
 - For "%[x, other special specifier] where input has some extra prefix, i.e. 0x, it should fully match, 0x0 is valid 0x is not
- to support the width requirement the input "0x123" should fail on "%2x" specifier or is the format wrong?

Other questions:
- Would fixing this bug break existing applications that might be depending on this buggy behavior?
Comment 27 Florian Weimer 2024-10-29 08:19:39 UTC
(In reply to Avinal Kumar from comment #26)
> Other questions:
> - Would fixing this bug break existing applications that might be depending
> on this buggy behavior?

We already have scanf variants per C standard. One possibility is to make the change for recently added variants only, so that they impact recently built applications only (which are presumably more easily fixed than historic applications).
Comment 28 Andreas Schwab 2024-10-29 09:01:18 UTC
The field width is a maximum, not a minimum.  If the input stream ends before width characters have been read, those characters read so far become the input item.  If the width is 2 then the input item can only consist one or two characters.  If the input stream contains 0x123 and the directive is %2x then the input item is 0x and this becomes a matching failure.
Comment 29 Joseph Myers 2024-10-29 16:50:21 UTC
I think this is simply a bug that should be fixed as such (for all standard versions), not something applications are at all likely to be relying on.

(We'll need to add more strtol/scanf versions for C2Y because of 0o / 0O octal input, but I'd expect that, and any other incompatible changes in C2Y scanf in future, to be the only difference in how those new versions behave.)
Comment 30 Maciej W. Rozycki 2024-12-03 12:26:34 UTC
(In reply to Rich Felker from comment #0)
> sscanf("abc", "%4c", buf) returns 1 instead of 0 or EOF (not sure which is
> correct) and leaves no way for the caller to know buf[3] is unfilled.
FYI the way to determine that for "%c" is via "%n", e.g.:

int count;
sscanf("abc", "%4c%n", buf, &count);

and you'll get 3 in "count" as expected, so you know buf[3] has been left
untouched.  Though I also read the standard as requiring this case to be
a matching failure.
Comment 31 Sourceware Commits 2025-03-25 10:21:26 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d1a621b735247ba0f7bf288e35a1b172cb6803f6

commit d1a621b735247ba0f7bf288e35a1b172cb6803f6
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add tests for formatted scanf input specifiers
    
    Add a collection of tests for formatted scanf input specifiers covering
    the b, d, i, o, u, x, and X integer conversions, the a, A, e, E, f, F,
    g, and G floating-point conversions, and the [, c, and s character
    conversions.  Also the hh, h, l, and ll length modifiers are covered
    with the integer conversions as are the l and L length modifier with the
    floating-point conversions.  The tests cover assignment suppressing and
    the field width as well, verifying the number of assignments made, the
    number of characters consumed and the value assigned.
    
    Add the common test code here as well as test cases for scanf, and then
    base Makefile infrastructure plus target-agnostic input data, for the
    character conversions and the `char', `short', and `long long' integer
    ones, signed and unsigned, with remaining input data and other functions
    from the scanf family deferred to subsequent additions.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    currently incorrectly accepted as valid data, such as '0b' or '0x' with
    the relevant integer conversions or sequences of an insufficient number
    of characters with the c conversion.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 32 Sourceware Commits 2025-03-25 10:21:31 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d7584e4d367ccb281ecf68980995e9b5ca0aff46

commit d7584e4d367ccb281ecf68980995e9b5ca0aff46
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf integer data for ILP32 targets
    
    Add Makefile infrastructure and `int' and `long' integer input data,
    signed and unsigned, for ILP32 targets.
    
    While the size of `int' data is the same between ILP32 and LP64 targets,
    resulting scanf output is different between them for out of range input
    data and while ISO C and POSIX both say that the behavior is undefined
    if the result of the conversion cannot be represented we want to keep
    track of our output to prevent inadvertent changes.  Hence the use of
    distinct `int' integer input data between ILP32 and LP64 targets.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0b' or '0x'.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 33 Sourceware Commits 2025-03-25 10:21:36 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=a2bacea239c1780b20a1b23a9c3c836ef61c6172

commit a2bacea239c1780b20a1b23a9c3c836ef61c6172
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf integer data for LP64 targets
    
    Add Makefile infrastructure and `int' and `long' integer input data,
    signed and unsigned, for LP64 targets.
    
    While the size of `int' data is the same between ILP32 and LP64 targets,
    resulting scanf output is different between them for out of range input
    data and while ISO C and POSIX both say that the behavior is undefined
    if the result of the conversion cannot be represented we want to keep
    track of our output to prevent inadvertent changes.  Hence the use of
    distinct `int' integer input data between ILP32 and LP64 targets.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0b' or '0x'.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 34 Sourceware Commits 2025-03-25 10:21:42 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=26df22636d5876352cbd53b8662173e461e1e220

commit 26df22636d5876352cbd53b8662173e461e1e220
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf float data for IEEE 754 binary32 format
    
    Add Makefile infrastructure and `float' real input data for targets
    using the IEEE 754 binary32 format.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 35 Sourceware Commits 2025-03-25 10:21:47 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0b311614395586608b5433dc8151e098d1906446

commit 0b311614395586608b5433dc8151e098d1906446
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf double data for IEEE 754 binary64 format
    
    Add Makefile infrastructure and `double' real input data for targets
    using the IEEE 754 binary64 format.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 36 Sourceware Commits 2025-03-25 10:21:53 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=1890e63c86ceb04a49a914dc2cafa9862e938ef6

commit 1890e63c86ceb04a49a914dc2cafa9862e938ef6
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf long double data for IEEE 754 binary128 format
    
    Add Makefile infrastructure and `long double' real input data for
    targets using the IEEE 754 binary128 format.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 37 Sourceware Commits 2025-03-25 10:22:00 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=771cda3c9cbbfc33a1a337d964e7749b245dec38

commit 771cda3c9cbbfc33a1a337d964e7749b245dec38
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf long double data for IEEE 754 binary64 format
    
    Add Makefile infrastructure and 64-bit `long double' real input data for
    targets switching between the IEEE 754 binary64 and IEEE 754 binary128
    formats with `-mlong-double-64' and `-mlong-double-128'.  Use modified
    output file names for the IEEE 754 binary64 format so as not to clash
    with the names used for IEEE 754 binary128 format tests made with common
    rules for the 'long double' data type.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 38 Sourceware Commits 2025-03-25 10:22:05 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=4bea073069e9e457258d082786297a867593d05b

commit 4bea073069e9e457258d082786297a867593d05b
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Tue Mar 25 09:40:20 2025 +0000

    stdio-common: Add scanf long double data for IBM 128-bit format
    
    Add Makefile infrastructure and IBM 128-bit 'long double' real input for
    targets switching between the IEEE 754 binary128 and IBM 128-bit formats
    with '-mabi=ieeelongdouble' and '-mabi=ibmlongdouble'.  Reuse IEEE 754
    binary128 input data but with modified output file names so as not to
    clash with the names used for IBM 128-bit format tests made with common
    rules for the 'long double' data type.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 39 Sourceware Commits 2025-03-28 12:45:55 UTC
The master branch has been updated by Maciej W. Rozycki <macro@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=d527f34cb1d487a4788fe88278a9ad832c53c3ee

commit d527f34cb1d487a4788fe88278a9ad832c53c3ee
Author: Maciej W. Rozycki <macro@redhat.com>
Date:   Fri Mar 28 12:35:52 2025 +0000

    stdio-common: Add scanf long double data for Intel/Motorola 80-bit format
    
    Add Makefile infrastructure, a format-specific test skeleton providing a
    data comparison implementation that ignores bits of data representation
    in memory that do not participate in holding floating-point data, and
    `long double' real input data for targets using the Intel/Motorola
    80-bit format.
    
    Keep input data disabled and referring to BZ #12701 for entries that are
    are currently incorrectly accepted as valid data, such as '0e', '0e+',
    '0x', '0x8p', '0x0p-', etc.
    
    Reviewed-by: Joseph Myers <josmyers@redhat.com>
Comment 40 Adhemerval Zanella 2025-12-22 17:59:20 UTC
*** Bug 33730 has been marked as a duplicate of this bug. ***