With the following testcase, it happens while it shouldn't, according to the manual: -----8<------- #include <stdio.h> #include <locale.h> #define STR "²éľÂíɱ²¡¶¾£¬ÖܺèµtÄúµÄ360²»×¨Òµ£¡" int main(void) { char buf[200]; setlocale(LC_ALL, ""); printf("%d\n", snprintf(buf, 150, "%.50s", STR)); return 0; } ----->8------- The manual page has this to say: About precision: An optional precision, in the form of a period (‘.’) followed by an optional decimal digit string.(...) This gives (...) the maximum number of characters to be printed from a string for s and S conversions. About s: If no l modifier is present: The const char * argument is expected to be a pointer to an array of character type(...) If an l modifier is present: The const wchar_t * argument is expected to be a pointer to an array of wide characters. Wide characters from the array are converted to multibyte characters (...) There is no "l" modifier, but still, the string goes through the multibyte conversion code, and fails because the string is invalid multibyte. Note, it only works with non UTF-8 locale set in LC_CTYPE or LC_ALL. This is debian bug http://bugs.debian.org/208308
Err the title is bogus, the thing is that sprintf returns -1 bogusly, if you run the previous testcase using LC_ALL=ja_JP.EUC-JP e.g. %s should not care about multibyte at all
Here's a simpler testcase for this bug courtesy of Jonathan Nieder: #include <stdio.h> #include <locale.h> int main(void) { int n; setlocale(LC_CTYPE, ""); n = printf("%.11s\n", "Author: \277"); perror("printf"); fprintf(stderr, "return value: %d\n", n); return 0; } Under a C locale that'll do the right thing: $ LANG=C ./test Author: � printf: Success return value: 10 But not under a UTF-8 locale, since \277 isn't a valid UTF-8 sequence: $ LANG=en_US.utf8 ./test printf: Invalid or incomplete multibyte or wide character return value: -1
This may be a duplicate of another longstanding bug report that Mr. Drepper has marked as WONTFIX despite it being obviously non-conformant. He seems to believe, for whatever reason, that avoiding writing out incomplete multibyte sequences in incorrect programs that use %.NNs formats haphazardly with multibyte strings is more important than not breaking correct programs that are using %.NNs with strings not intended to be interpreted as multibyte characters in the current locale. Neither C nor POSIX makes any requirement that strings used with the %s format specifier be valid multibyte character sequences in the current locale; glibc is simply wrong on this issue.
I'm moving this bug to unassigned. We need someone to review this issue and confirm the problem.
What would really make this bug stand out is if someone did the work to review what the *standards* say about this particular behaviour. Reviewing what should be accepted is 90% of the work in issues like this. Help on this from the bug submitter would be very useful.
My pleasure. Per N1256 (C99+TC1/2/3): s If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type.246) Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.247) And the referenced footnotes: 246) No special provisions are made for multibyte characters. 247) Redundant shift sequences may result if multibyte characters have a state-dependent encoding. I don't think it could be clearer than what footnote 246 says about %s without the l modifier.
Many thanks Rich, that really helps. I'll put this into the queue to see if we can fix it for 2.16.
Any progress on this? I'm going to try to find and reopen the bug it's a duplicate of; perhaps then they can be marked as duplicate.
I'm moving this to 2.17. We didn't get a chance to look into this issue. We are in the code freeze for 2.16 and I think it will take considerable effort for the community to develop and test a fix for this.
Based on experience from the sinking feeling of watching target milestones slowly march forward on the Mozilla and Chromium bugtrackers: could you target it at no particular milestone instead? That way, anyone stumbling on this will realize that it is their own responsibility to get it fixed quickly if they care for that to happen. Thanks, Jonathan
(In reply to comment #10) > Based on experience from the sinking feeling of watching target milestones > slowly march forward on the Mozilla and Chromium bugtrackers: could you target > it at no particular milestone instead? That way, anyone stumbling on this will > realize that it is their own responsibility to get it fixed quickly if they > care for that to happen. > > Thanks, > Jonathan Jonathan, That's a discussion that should be had on libc-alpha@sourceware.org. Please start a thread there :-)
Fixing this bug should not be complex or invasive. Basically, it's a case of special-case code having been incorrectly written for a case that does not need to be, and should not be, special-cased. The fix should essentially just remove this incorrect code. I haven't read the source yet, but I'd be willing to look into it and possibly submit a patch.
Created attachment 6472 [details] patch to fix the bug This patch fixes the bug. As I suspected, it's all -'s, no +'s. If you want to fix the indention on the one line that's left, you can and it'll be a single +...
Hi, Rich Felker wrote: > --- vfprintf.c.orig > +++ vfprintf.c > @@ -1168,42 +1168,7 @@ > else if (!is_long && spec != L_('S')) \ > { \ > if (prec != -1) \ > - { \ > - /* Search for the end of the string, but don't search past \ > - the length (in bytes) specified by the precision. Also \ > - don't use incomplete characters. */ \ > - if (_NL_CURRENT_WORD (LC_CTYPE, _NL_CTYPE_MB_CUR_MAX) == 1) \ > len = __strnlen (string, prec); \ > - else \ > - { \ > - /* In case we have a multibyte character set the \ > - situation is more complicated. We must not copy \ > - bytes at the end which form an incomplete character. */\ [...] > - len = str2 - string - (ps.__count & 7); \ > - } \ > - } \ > else \ > len = strlen (string); \ > } \ > This patch fixes the bug. As I suspected, it's all -'s, no +'s. Thanks. wprintf(3) tells me: s If no l modifier is present: The const char * argument is expected to be a pointer to an array of character type (pointer to a string) containing a multibyte character sequence beginning in the initial shift state. Characters from the array are converted to wide characters (each by a call to the mbrtowc(3) function with a conversion state starting in the initial state before the first byte). The resulting wide characters are written up to (but not including) the terminating null wide character. [etc] C99 says that in this case "the argument shall be a pointer to the initial element of a character array containing a multibyte character sequence beginning in the initial shift state". POSIX uses similar wording. C99 describes an error condition: The fwprintf function returns the number of wide characters transmitted, or a negative value if an output or encoding error occurred. POSIX does not document what value errno should have in this case, but presumably EILSEQ is typical, for consistency with mbrtowc. Luckily these considerations only matter in the defined(COMPILE_WPRINTF) case. Your patch only modifies the !defined(COMPILE_WPRINTF) code. So for what it's worth: Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
The bug has nothing to do with wprintf. It applies to the non-wide printf functions. The text you cited is about wprintf and is therefore irrelevant.
(In reply to comment #15) > The bug has nothing to do with wprintf. It applies to the non-wide printf > functions. The text you cited is about wprintf and is therefore irrelevant. Did you read my message to the end? Is your reply meant to motivate people to help you? I would be thrilled in printf and wprintf were defined in separate files with shared code (if any) in a third file, but that's a topic for another day.
Apologies for jumping on your reply. I did read the whole thing but didn't get what you intended the message to be, and the whole history of this bug has been folks misquoting text that has nothing to do with the non-long %s format in the non-wide printf functions to justify the buggy behavior... :-( With that said, thanks for the review.
(In reply to comment #17) > Apologies for jumping on your reply. I did read the whole thing but didn't get > what you intended the message to be, and the whole history of this bug has been > folks misquoting text that has nothing to do with the non-long %s format in the > non-wide printf functions to justify the buggy behavior... :-( I had been assuming that the bug was introduced by copying the wprintf() implementation for reuse in printf() --- i.e., no misunderstanding of the standard but just simple operator error. But actually the patch introducing the regression (40c014b3, "Correct handling of multibyte character strings in %s format with precision", 2000-07-22) only affected vsnprintf() and co. The changelog comment makes no sense. There is no explanation on libc-alpha@ or libc-hacker@ from around that time. It looks like a simple, unilateral, bad decision. There's a patch identical to yours complete with changelog and copyright assignment at <http://sourceware.org/ml/libc-hacker/2010-08/msg00009.html>. It just reverts 40c014b3. It was ignored. A story from a worse time. Thanks for fixing it.
A testcase for the fix should be added to the glibc testsuite (this applies to any bug fix that can reasonably be covered in the testsuite and isn't fixing an existing test failure). (It's probably a good idea to add a test for wprintf as well to verify that wprintf continues to behave as expected, if this isn't already covered by existing wprintf tests.)
I don't think there's any concern about breaking wprintf. wprintf inherently operates by converting the argument to %s to wide characters since its output it wide characters; as such, there's no way it could avoid failing when the input is an invalid multibyte sequence unless it completely ignored the failure return.
Created attachment 6474 [details] test conversion from UTF-8 and handling of incomplete characters (fails) I assume Joseph means to spend some time to avoid future authors breaking wprintf now that our attention's on it, not to avoid breaking wprintf when applying your patch. Your patch doesn't even touch wprintf code. I tried writing a wprintf testcase, and it's not gone so well. For example, I tried wchar_t buf[1000]; int n; if (!setlocale(LC_ALL, "ja_JP.UTF-8)) { perror("setlocale"); return 1; } n = swprintf(buf, sizeof(buf)/sizeof(wchar_t), L"%3s", "\xef\xbd\xa1g\xef\xbd\xa2h\xef\xbd\xa3i\xef\xbd\xa4j"); I expected swprintf to return 3 and buf to equal { 0xff61, 'g', 0xff62, 0 } Instead, swprintf returns 8 and converts the entire string.
That's because you made the classic printf mistake of confusing width and precision. Try L"%.3s".
Created attachment 6475 [details] Test swprintf's handling of incomplete multibyte characters (In reply to comment #22) > That's because you made the classic printf mistake of confusing width and > precision. > > Try L"%.3s". Good catch, thanks. Here's a corrected test, which works fine.
On Fri, 22 Jun 2012, carlos_odonell at mentor dot com wrote: > > Based on experience from the sinking feeling of watching target milestones > > slowly march forward on the Mozilla and Chromium bugtrackers: could you target > > it at no particular milestone instead? That way, anyone stumbling on this will > > realize that it is their own responsibility to get it fixed quickly if they > > care for that to happen. > > > > Thanks, > > Jonathan > > Jonathan, That's a discussion that should be had on libc-alpha@sourceware.org. > Please start a thread there :-) Yes, we do need to discuss how best to use milestones based on our experiences of the 2.16 release cycle. Experience with bug trackers of other problems would certainly be a useful contribution to that discussion (and the related one of how to get bugs fixed faster than they are filed).
When a bug has been known for over 4 years and there's such a simple fix, why is there discussion of putting off fixing it until later? Or am I misunderstanding Jonathan's reply again?
(In reply to comment #25) > When a bug has been known for over 4 years and there's such a simple fix, why > is there discussion of putting off fixing it until later? Or am I > misunderstanding Jonathan's reply again? This bug will likely be fixed in 2.17 development, with a backport to 2.16.1 if requested. Does that match your expectations?
Created attachment 6477 [details] Test sprintf's handling of incomplete multibyte characters And here's a test to make sure the bug doesn't rear its head again in the future. I personally think Rich's fix should be safe to apply for 2.16.
commit 715a900c Author: Jeff Law <law@redhat.com> Date: Fri Sep 28 12:48:42 2012 -0600 2012-09-28 Andreas Schwab <schwab@linux-m68k.org> [BZ #6530] * stdio-common/vfprintf.c (process_string_arg): Revert 2000-07-22 change.
Jonathan, Could you please open one new issue per backport request and use keywords to mark which branch to backport the patch to. Also we expect the person or groups that needs the backport to do the work of making and testing the patch. The release branch manager simply does the approval and analysis of risk for the branch. Thanks.
Sure, no problem. Do I understand correctly that this means testing the patch against 2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each branch describing the result?
(In reply to comment #30) > Sure, no problem. > > Do I understand correctly that this means testing the patch against > 2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each > branch describing the result? Yes, that way we can track the backport to each branch individually and the various release branch maintainers and the release branch shareholders can comment on the change. The keyword allows a release branch manager to quickly identify issues open against their branch. If you have any feedback about this process I'd be more than happy to hear it. We don't have much good formal process for bugzilla, but we're trying :-)
On Wed, 3 Oct 2012, jrnieder at gmail dot com wrote: > Do I understand correctly that this means testing the patch against > 2.16, 2.15, 2.14, 2.12, and 2.11 and then filing one bug for each > branch describing the result? There is no real evidence of 2.14 or 2.12 being alive - no response from their maintainer when I asked on libc-alpha whether they are alive or dead - and 2.11 is considered dead <http://sourceware.org/ml/libc-alpha/2012-08/msg00557.html>. I don't think it's worth doing any backports to those three branches.
Backport requests for known living branches filed as bug 14667 and bug 14667.
(In reply to comment #33) > Backport requests for known living branches filed as bug 14667 and bug 14667. ... and bug 14668. Sorry for the noise.
*** Bug 16298 has been marked as a duplicate of this bug. ***
This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "GNU C Library master sources". The branch, release/2.15/master has been updated via 7def6a63a0298d2e5275820db22eb40ac3dcbe4b (commit) from 1ba48eb07a72690406c0ffda642a963c88639752 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=7def6a63a0298d2e5275820db22eb40ac3dcbe4b commit 7def6a63a0298d2e5275820db22eb40ac3dcbe4b Author: Carlos O'Donell <carlos@redhat.com> Date: Thu Jan 30 13:19:58 2014 -0500 Don't parse %s format argument as multibyte string 2012-09-28 Andreas Schwab <schwab@linux-m68k.org> [BZ #6530] * stdio-common/vfprintf.c (process_string_arg): Revert 2000-07-22 change. 2011-09-28 Jonathan Nieder <jrnieder@gmail.com> * stdio-common/Makefile (tst-sprintf-ENV): Set environment for testcase. * stdio-common/tst-sprintf.c: Include <locale.h> (main): Test sprintf's handling of incomplete multibyte characters. (cherry picked from commit 715a900c9085907fa749589bf738b192b1a2bda5) ----------------------------------------------------------------------- Summary of changes: ChangeLog | 14 ++++++++++++++ NEWS | 8 ++++---- stdio-common/Makefile | 1 + stdio-common/tst-sprintf.c | 13 +++++++++++++ stdio-common/vfprintf.c | 38 +++----------------------------------- 5 files changed, 35 insertions(+), 39 deletions(-)
*** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla.