Bug 6530 - *printf() and incomplete multibyte sequences returns -1 bogusly
Summary: *printf() and incomplete multibyte sequences returns -1 bogusly
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: stdio (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: 2.17
Assignee: Not yet assigned to anyone
URL:
Keywords:
: 16298 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-05-18 10:23 UTC by Pierre Habouzit
Modified: 2014-06-13 11:29 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
patch to fix the bug (726 bytes, patch)
2012-06-23 00:54 UTC, Rich Felker
Details | Diff
test conversion from UTF-8 and handling of incomplete characters (fails) (1.03 KB, patch)
2012-06-23 20:34 UTC, Jonathan Nieder
Details | Diff
Test swprintf's handling of incomplete multibyte characters (1.01 KB, patch)
2012-06-23 21:00 UTC, Jonathan Nieder
Details | Diff
Test sprintf's handling of incomplete multibyte characters (782 bytes, patch)
2012-06-25 01:00 UTC, Jonathan Nieder
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pierre Habouzit 2008-05-18 10:23:18 UTC
With the following testcase, it happens while it shouldn't, according to
the manual:
-----8<-------
#include <stdio.h>
#include <locale.h>

#define STR "²éľÂíɱ²¡¶¾£¬ÖܺèµtÄúµÄ360²»×¨Òµ£¡"

int main(void) {
        char buf[200];
        setlocale(LC_ALL, "");
        printf("%d\n", snprintf(buf, 150, "%.50s", STR));
        return 0;
}
----->8-------

The manual page has this to say:
 About precision:
  An optional precision, in the form of a period (&#8216;.&#8217;)  followed by an
  optional decimal digit string.(...) This gives (...) the maximum
  number of characters to be printed from a string for s and S
  conversions.

 About s:
  If no l modifier is present: The const char * argument is expected to
  be a pointer to an array of character type(...)

  If an l modifier is present: The const wchar_t * argument is expected
  to be a pointer to an array of wide characters.  Wide characters from
  the  array  are converted  to  multibyte  characters (...)

There is no "l" modifier, but still, the string goes through the
multibyte conversion code, and fails because the string is invalid
multibyte.

Note, it only works with non UTF-8 locale set in LC_CTYPE or LC_ALL.

This is debian bug http://bugs.debian.org/208308
Comment 1 Pierre Habouzit 2008-05-18 10:38:54 UTC
Err the title is bogus, the thing is that sprintf returns -1 bogusly, if you 
run the previous testcase using  LC_ALL=ja_JP.EUC-JP e.g.

%s should not care about multibyte at all
Comment 2 Ævar Arnfjörð Bjarmason 2010-08-30 13:47:37 UTC
Here's a simpler testcase for this bug courtesy of Jonathan Nieder:
    
    #include <stdio.h>
    #include <locale.h>
    
    int main(void)
    {
        int n;
    
        setlocale(LC_CTYPE, "");
        n = printf("%.11s\n", "Author: \277");
        perror("printf");
        fprintf(stderr, "return value: %d\n", n);
        return 0;
    }

Under a C locale that'll do the right thing:
    
    $ LANG=C ./test
    Author: &#65533;
    printf: Success
    return value: 10

But not under a UTF-8 locale, since \277 isn't a valid UTF-8 sequence:
    
    $ LANG=en_US.utf8 ./test
    printf: Invalid or incomplete multibyte or wide character
    return value: -1
Comment 3 Rich Felker 2012-03-11 16:31:51 UTC
This may be a duplicate of another longstanding bug report that Mr. Drepper has marked as WONTFIX despite it being obviously non-conformant. He seems to believe, for whatever reason, that avoiding writing out incomplete multibyte sequences in incorrect programs that use %.NNs formats haphazardly with multibyte strings is more important than not breaking correct programs that are using %.NNs with strings not intended to be interpreted as multibyte characters in the current locale. Neither C nor POSIX makes any requirement that strings used with the %s format specifier be valid multibyte character sequences in the current locale; glibc is simply wrong on this issue.
Comment 4 Carlos O'Donell 2012-03-11 16:55:49 UTC
I'm moving this bug to unassigned. We need someone to review this issue and confirm the problem.
Comment 5 Carlos O'Donell 2012-03-11 16:59:07 UTC
What would really make this bug stand out is if someone did the work to review what the *standards* say about this particular behaviour. Reviewing what should be accepted is 90% of the work in issues like this. Help on this from the bug submitter would be very useful.
Comment 6 Rich Felker 2012-03-11 17:03:57 UTC
My pleasure. Per N1256 (C99+TC1/2/3):

s
If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type.246) Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.247)

And the referenced footnotes:

246) No special provisions are made for multibyte characters.

247) Redundant shift sequences may result if multibyte characters have a state-dependent encoding.

I don't think it could be clearer than what footnote 246 says about %s without the l modifier.
Comment 7 Carlos O'Donell 2012-03-11 17:08:08 UTC
Many thanks Rich, that really helps. I'll put this into the queue to see if we can fix it for 2.16.
Comment 8 Rich Felker 2012-05-14 13:21:52 UTC
Any progress on this? I'm going to try to find and reopen the bug it's a duplicate of; perhaps then they can be marked as duplicate.
Comment 9 Carlos O'Donell 2012-06-22 21:02:47 UTC
I'm moving this to 2.17. We didn't get a chance to look into this issue. We are in the code freeze for 2.16 and I think it will take considerable effort for the community to develop and test a fix for this.
Comment 10 Jonathan Nieder 2012-06-22 21:23:20 UTC
Based on experience from the sinking feeling of watching target milestones slowly march forward on the Mozilla and Chromium bugtrackers: could you target it at no particular milestone instead? That way, anyone stumbling on this will realize that it is their own responsibility to get it fixed quickly if they care for that to happen.

Thanks,
Jonathan
Comment 11 Carlos O'Donell 2012-06-22 22:08:14 UTC
(In reply to comment #10)
> Based on experience from the sinking feeling of watching target milestones
> slowly march forward on the Mozilla and Chromium bugtrackers: could you target
> it at no particular milestone instead? That way, anyone stumbling on this will
> realize that it is their own responsibility to get it fixed quickly if they
> care for that to happen.
> 
> Thanks,
> Jonathan

Jonathan, That's a discussion that should be had on libc-alpha@sourceware.org. Please start a thread there :-)
Comment 12 Rich Felker 2012-06-22 22:20:23 UTC
Fixing this bug should not be complex or invasive. Basically, it's a case of special-case code having been incorrectly written for a case that does not need to be, and should not be, special-cased. The fix should essentially just remove this incorrect code. I haven't read the source yet, but I'd be willing to look into it and possibly submit a patch.
Comment 13 Rich Felker 2012-06-23 00:54:03 UTC
Created attachment 6472 [details]
patch to fix the bug

This patch fixes the bug. As I suspected, it's all -'s, no +'s. If you want to fix the indention on the one line that's left, you can and it'll be a single +...
Comment 14 Jonathan Nieder 2012-06-23 02:31:05 UTC
Hi,

Rich Felker wrote:

> --- vfprintf.c.orig
> +++ vfprintf.c
> @@ -1168,42 +1168,7 @@
>  	else if (!is_long && spec != L_('S'))				      \
>  	  {								      \
>  	    if (prec != -1)						      \
> -	      {								      \
> -		/* Search for the end of the string, but don't search past    \
> -		   the length (in bytes) specified by the precision.  Also    \
> -		   don't use incomplete characters.  */			      \
> -		if (_NL_CURRENT_WORD (LC_CTYPE, _NL_CTYPE_MB_CUR_MAX) == 1)   \
>  		  len = __strnlen (string, prec);			      \
> -		else							      \
> -		  {							      \
> -		    /* In case we have a multibyte character set the	      \
> -		       situation is more complicated.  We must not copy	      \
> -		       bytes at the end which form an incomplete character. */\
[...]
> -		      len = str2 - string - (ps.__count & 7);		      \
> -		  }							      \
> -	      }								      \
>  	    else							      \
>  	      len = strlen (string);					      \
>  	  }								      \
> This patch fixes the bug. As I suspected, it's all -'s, no +'s.

Thanks.  wprintf(3) tells me:

	s
		If no l modifier is present: The const char * argument
		is expected to be a pointer to an array of character
		type (pointer to a string) containing a multibyte character
		sequence beginning in the initial shift state.
		Characters from the array are converted to wide
		characters (each by a call to the mbrtowc(3) function
		with a conversion state starting in the initial state
		before the first byte).  The resulting wide characters
		are written up to (but not including) the terminating
		null wide character.
[etc]

C99 says that in this case "the argument shall be a pointer to the
initial element of a character array containing a multibyte character
sequence beginning in the initial shift state".  POSIX uses similar
wording.  C99 describes an error condition:

	The fwprintf function returns the number of wide characters
	transmitted, or a negative value if an output or encoding
	error occurred.

POSIX does not document what value errno should have in this case, but
presumably EILSEQ is typical, for consistency with mbrtowc.

Luckily these considerations only matter in the defined(COMPILE_WPRINTF)
case.  Your patch only modifies the !defined(COMPILE_WPRINTF) code.
So for what it's worth:

Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Comment 15 Rich Felker 2012-06-23 03:04:11 UTC
The bug has nothing to do with wprintf. It applies to the non-wide printf functions. The text you cited is about wprintf and is therefore irrelevant.
Comment 16 Jonathan Nieder 2012-06-23 03:19:33 UTC
(In reply to comment #15)
> The bug has nothing to do with wprintf. It applies to the non-wide printf
> functions. The text you cited is about wprintf and is therefore irrelevant.

Did you read my message to the end?  Is your reply meant to motivate people
to help you?

I would be thrilled in printf and wprintf were defined in separate files
with shared code (if any) in a third file, but that's a topic for another
day.
Comment 17 Rich Felker 2012-06-23 03:30:16 UTC
Apologies for jumping on your reply. I did read the whole thing but didn't get what you intended the message to be, and the whole history of this bug has been folks misquoting text that has nothing to do with the non-long %s format in the non-wide printf functions to justify the buggy behavior... :-(

With that said, thanks for the review.
Comment 18 Jonathan Nieder 2012-06-23 04:22:10 UTC
(In reply to comment #17)
> Apologies for jumping on your reply. I did read the whole thing but didn't get
> what you intended the message to be, and the whole history of this bug has been
> folks misquoting text that has nothing to do with the non-long %s format in the
> non-wide printf functions to justify the buggy behavior... :-(

I had been assuming that the bug was introduced by copying the wprintf()
implementation for reuse in printf() --- i.e., no misunderstanding of the
standard but just simple operator error.

But actually the patch introducing the regression (40c014b3, "Correct
handling of multibyte character strings in %s format with precision",
2000-07-22) only affected vsnprintf() and co.  The changelog comment
makes no sense.  There is no explanation on libc-alpha@ or libc-hacker@
from around that time.  It looks like a simple, unilateral, bad decision.

There's a patch identical to yours complete with changelog and copyright
assignment at <http://sourceware.org/ml/libc-hacker/2010-08/msg00009.html>.
It just reverts 40c014b3.  It was ignored.  A story from a worse time.

Thanks for fixing it.
Comment 19 jsm-csl@polyomino.org.uk 2012-06-23 11:00:23 UTC
A testcase for the fix should be added to the glibc testsuite (this 
applies to any bug fix that can reasonably be covered in the testsuite and 
isn't fixing an existing test failure).  (It's probably a good idea to add 
a test for wprintf as well to verify that wprintf continues to behave as 
expected, if this isn't already covered by existing wprintf tests.)
Comment 20 Rich Felker 2012-06-23 11:37:57 UTC
I don't think there's any concern about breaking wprintf. wprintf inherently operates by converting the argument to %s to wide characters since its output it wide characters; as such, there's no way it could avoid failing when the input is an invalid multibyte sequence unless it completely ignored the failure return.
Comment 21 Jonathan Nieder 2012-06-23 20:34:55 UTC
Created attachment 6474 [details]
test conversion from UTF-8 and handling of incomplete characters (fails)

I assume Joseph means to spend some time to avoid future authors breaking
wprintf now that our attention's on it, not to avoid breaking wprintf when
applying your patch.  Your patch doesn't even touch wprintf code.

I tried writing a wprintf testcase, and it's not gone so well.  For
example, I tried

  wchar_t buf[1000];
  int n;

  if (!setlocale(LC_ALL, "ja_JP.UTF-8)) {
    perror("setlocale");
    return 1;
  }
  n = swprintf(buf, sizeof(buf)/sizeof(wchar_t), L"%3s",
        "\xef\xbd\xa1g\xef\xbd\xa2h\xef\xbd\xa3i\xef\xbd\xa4j");

I expected swprintf to return 3 and buf to equal

  { 0xff61, 'g', 0xff62, 0 }

Instead, swprintf returns 8 and converts the entire string.
Comment 22 Rich Felker 2012-06-23 20:53:14 UTC
That's because you made the classic printf mistake of confusing width and precision.

Try L"%.3s".
Comment 23 Jonathan Nieder 2012-06-23 21:00:49 UTC
Created attachment 6475 [details]
Test swprintf's handling of incomplete multibyte characters

(In reply to comment #22)
> That's because you made the classic printf mistake of confusing width and
> precision.
> 
> Try L"%.3s".

Good catch, thanks.  Here's a corrected test, which works fine.
Comment 24 jsm-csl@polyomino.org.uk 2012-06-24 15:03:10 UTC
On Fri, 22 Jun 2012, carlos_odonell at mentor dot com wrote:

> > Based on experience from the sinking feeling of watching target milestones
> > slowly march forward on the Mozilla and Chromium bugtrackers: could you target
> > it at no particular milestone instead? That way, anyone stumbling on this will
> > realize that it is their own responsibility to get it fixed quickly if they
> > care for that to happen.
> > 
> > Thanks,
> > Jonathan
> 
> Jonathan, That's a discussion that should be had on libc-alpha@sourceware.org.
> Please start a thread there :-)

Yes, we do need to discuss how best to use milestones based on our 
experiences of the 2.16 release cycle.  Experience with bug trackers of 
other problems would certainly be a useful contribution to that discussion 
(and the related one of how to get bugs fixed faster than they are filed).
Comment 25 Rich Felker 2012-06-24 21:44:37 UTC
When a bug has been known for over 4 years and there's such a simple fix, why is there discussion of putting off fixing it until later? Or am I misunderstanding Jonathan's reply again?
Comment 26 Carlos O'Donell 2012-06-24 22:03:11 UTC
(In reply to comment #25)
> When a bug has been known for over 4 years and there's such a simple fix, why
> is there discussion of putting off fixing it until later? Or am I
> misunderstanding Jonathan's reply again?

This bug will likely be fixed in 2.17 development, with a backport to 2.16.1 if requested. Does that match your expectations?
Comment 27 Jonathan Nieder 2012-06-25 01:00:26 UTC
Created attachment 6477 [details]
Test sprintf's handling of incomplete multibyte characters

And here's a test to make sure the bug doesn't rear its head again in the future.

I personally think Rich's fix should be safe to apply for 2.16.
Comment 28 Jonathan Nieder 2012-10-01 23:42:08 UTC
commit 715a900c
Author: Jeff Law <law@redhat.com>
Date:   Fri Sep 28 12:48:42 2012 -0600

    2012-09-28  Andreas Schwab  <schwab@linux-m68k.org>
    
            [BZ #6530]
            * stdio-common/vfprintf.c (process_string_arg): Revert
            2000-07-22 change.
Comment 29 Carlos O'Donell 2012-10-02 21:20:23 UTC
Jonathan,

Could you please open one new issue per backport request and use keywords to mark which branch to backport the patch to.

Also we expect the person or groups that needs the backport to do the work of making and testing the patch. The release branch manager simply does the approval and analysis of risk for the branch.

Thanks.
Comment 30 Jonathan Nieder 2012-10-03 00:51:02 UTC
Sure, no problem.

Do I understand correctly that this means testing the patch against
2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each
branch describing the result?
Comment 31 Carlos O'Donell 2012-10-03 00:54:00 UTC
(In reply to comment #30)
> Sure, no problem.
> 
> Do I understand correctly that this means testing the patch against
> 2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each
> branch describing the result?

Yes, that way we can track the backport to each branch individually and the various release branch maintainers and the release branch shareholders can comment on the change. The keyword allows a release branch manager to quickly identify issues open against their branch.

If you have any feedback about this process I'd be more than happy to hear it. We don't have much good formal process for bugzilla, but we're trying :-)
Comment 32 jsm-csl@polyomino.org.uk 2012-10-03 21:19:42 UTC
On Wed, 3 Oct 2012, jrnieder at gmail dot com wrote:

> Do I understand correctly that this means testing the patch against
> 2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each
> branch describing the result?

There is no real evidence of 2.14 or 2.12 being alive - no response from 
their maintainer when I asked on libc-alpha whether they are alive or dead 
- and 2.11 is considered dead 
<http://sourceware.org/ml/libc-alpha/2012-08/msg00557.html>.  I don't 
think it's worth doing any backports to those three branches.
Comment 33 Jonathan Nieder 2012-10-04 01:47:58 UTC
Backport requests for known living branches filed as bug 14667 and bug 14667.
Comment 34 Jonathan Nieder 2012-10-04 01:50:06 UTC
(In reply to comment #33)
> Backport requests for known living branches filed as bug 14667 and bug 14667.

... and bug 14668.  Sorry for the noise.
Comment 35 Andreas Schwab 2013-12-07 19:41:32 UTC
*** Bug 16298 has been marked as a duplicate of this bug. ***
Comment 36 Sourceware Commits 2014-01-30 18:22:35 UTC
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, release/2.15/master has been updated
       via  7def6a63a0298d2e5275820db22eb40ac3dcbe4b (commit)
      from  1ba48eb07a72690406c0ffda642a963c88639752 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7def6a63a0298d2e5275820db22eb40ac3dcbe4b

commit 7def6a63a0298d2e5275820db22eb40ac3dcbe4b
Author: Carlos O'Donell <carlos@redhat.com>
Date:   Thu Jan 30 13:19:58 2014 -0500

    Don't parse %s format argument as multibyte string
    
    2012-09-28  Andreas Schwab  <schwab@linux-m68k.org>
    
                [BZ #6530]
                * stdio-common/vfprintf.c (process_string_arg): Revert
                2000-07-22 change.
    
    2011-09-28  Jonathan Nieder  <jrnieder@gmail.com>
    
                * stdio-common/Makefile (tst-sprintf-ENV): Set environment
                for testcase.
                * stdio-common/tst-sprintf.c: Include <locale.h>
                (main): Test sprintf's handling of incomplete multibyte
                characters.
    
    (cherry picked from commit 715a900c9085907fa749589bf738b192b1a2bda5)

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                  |   14 ++++++++++++++
 NEWS                       |    8 ++++----
 stdio-common/Makefile      |    1 +
 stdio-common/tst-sprintf.c |   13 +++++++++++++
 stdio-common/vfprintf.c    |   38 +++-----------------------------------
 5 files changed, 35 insertions(+), 39 deletions(-)
Comment 37 Jackie Rosen 2014-02-16 19:42:25 UTC Comment hidden (spam)