RFC: *scanf vs. overflow

Rich Felker dalias@libc.org
Sat May 23 01:16:16 GMT 2020


On Fri, May 22, 2020 at 03:59:14PM -0500, Eric Blake via Libc-alpha wrote:
> It has long been known that the C specification of *scanf() leaves
> behavior undefined for things like
> int i;
> sscanf("9999999999999999", "%i", &i);
> 
> C11 7.21.6.2 P12
> "Matches an optionally signed integer, whose format is the same as
> expected for the subject sequence of the strtol function with the
> value 0 for the base argument."
> C11 7.21.6.2 P10
> "If this object does not have an appropriate type, or if the result
> of the conversion cannot be represented in the object, the behavior
> is undefined."
> 
> as there is an overflow when consuming the input which matches the
> strtol subject sequence but does not fit in the width of an int.  On
> my Linux system, 'man sscanf' mentions that ERANGE might be set in
> such a case, but neither C nor POSIX actually requires this
> behavior; other likely behaviors is storing the value mod 2^32 into
> i, or storing INT_MAX into i, or ...
> 
> This is annoying - the only safe way to parse integers from
> untrustworthy sources, where overflow MUST be detected, is to
> manually open-code strtol() calls, which can get quite lengthy in
> comparison to the concise representations possible with *scanf.
> 
> Would glibc be willing to consider a GNU extension to add an
> optional flag character between '%' and the various numeric
> conversion specifiers (both integral based on strto*l, and floating
> point based on strtod), where we could force *scanf to treat numeric
> overflow as a matching failure, rather than undefined behavior?  Or
> even a second flag to request that printf stop consuming characters
> if the next character in input would cause overflow in the current
> specifier, leaving that character to instead be matched to the
> remainder of the format string?

Since conversion specifier forms outside the standard *also* have
undefined behavior, I see no advantage to defining that particular
undefined case vs just defining the result of the overflowing
conversion, unless you're worried the standard might later define a
conflicting definition. Neither way is amenable to configure detection
(without breaking cross compiling) without also adopting something
like my proposal on libc-coord:
https://www.openwall.com/lists/libc-coord/2020/04/22/1

BTW there is a portable only-somewhat-hideous way to do this with
sscanf: using assignment suppression combined with %n, then strtol,
etc. with the offsets sproduced by %n.

> Let's suppose for arguments that we add '^' as a request to force
> overflow to be a matching error.  Then sscanf("9999999999999999",
> "%^i", &i) would be well-specified to return 0, rather than
> returning 1 with an unknown value assigned into i or any other
> behavior that other libc do with the undefined behavior when the ^
> is not present.
> 
> And if glibc likes the idea of such an extension, and we see an
> uptick in applications actually using it, I'd also be happy to
> champion the addition of such an extension in POSIX (but the POSIX
> folks will definitely want to see existing practice first - both an
> implementation and applications that use that implementation).  The
> libguestfs suite of programs is willing to be an early adopter, if
> glibc is willing to pursue adding such a safety valve.

I think it would be more useful to look for existing practice where
the UB blows up in horrible ways, and if there is none (if all
implementations behave somewhat reasonably) define the intersection of
their behaviors as standard and get rid of the UB here. A new feature
will not reliably be usable for decades in portable software, but new
documentation of existing universal practice would be immediately
usable.

Rich


More information about the Libc-alpha mailing list