This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Cygwin 1.7.1 sprintf() with format string having 8th bit set


Andy Koppe wrote:
2010/1/4 Joseph Quinsey
In Cygwin 1,7.1, sprintf() with the format string having an 8th bit set
appears to be broken. Sample code (where I've indicated the backslashes in
the comments, in case they are stripped out by the mailer):

#include <stdio.h>

int main (void)
{
   unsigned char foo[30] = "";
   unsigned char bar[30] = "";
   unsigned char xxx[30] = "";
   sprintf (foo, "\100%s", "ABCD"); /* this is backslash one zero zero   */
   sprintf (bar, "\300%s", "ABCD"); /* this is backslash three zero zero */
   sprintf (xxx, "\300ABCD");       /* this is backslash three zero zero */
   printf ("%d %d %d %d %d\n", foo[0],foo[1],foo[2],foo[3],foo[4]);
   printf ("%d %d %d %d %d\n", bar[0],bar[1],bar[2],bar[3],bar[4]);
   printf ("%d %d %d %d %d\n", xxx[0],xxx[1],xxx[2],xxx[3],xxx[4]);
   return 0;
}

gives:

64 65 66 67 68
0 0 0 0 0
192 65 66 67 68

The second line of the output should be the same as the third.

The issue here is that the character set of the "C" locale in Cygwin 1.7 is UTF-8 and that the \300 on its own is an invalid UTF-8 byte.
My assumption has been that *printf should be byte-transparent unless where it uses explicit wide character arguments.
After all, legacy applications that do not care about locales at all may legitimately assume this since a C char [] is a byte sequence;
this is not affected by the legacy casual usage of the word "character" referring to a char value which does not automatically imply "wide character".


Reading http://www.opengroup.org/onlinepubs/9699919799/functions/fprintf.html:

[EILSEQ]
   A wide-character code that does not correspond to a valid character
   has been detected.

this explicitly refers to "wide characters" which are mentioned elsewhere in this document only as argument values for the %lc and %ls flags.
I don't think it needs to, or even should, be interpreted to refer to the format string.


To get well-defined behaviour, you need to invoke setlocale(LC_CTYPE,
...) with the approriate locale.

See the thread at http://cygwin.com/ml/cygwin/2009-12/msg00980.html
for more on this.
In that thread, someone had originally confused char * with wchar [] - the issue resolves cleanly if these are properly distinguished.

Comments on the EILSEQ clause from that thread:
> It's talking about "characters" rather than "bytes" there, which I
> think does leave the behaviour for invalid bytes undefined,
No, it's talking about "wide character codes" and "valid characters", to be picky.

It's actually well-defined - non-characters in the format string MUST make
printf fail.
I claim it's absolutely not well-defined and I strongly disagree here.

The issue wasn't with wide characters, but invalid multibyte chars.
But anyway, we're agreed that printf is right to bail out.
I don't think there is such a thing like an invalid multibyte character in a char [] unless it is being interpreted with a multi-byte function, that's what e.g. the mb* functions are for.
In a legacy application, especially in an sprintf which may not even be intended for printing, there is no intent to apply a multi-byte interpretation. This is over-imposing semantics on a basic C type.


So I do not agree that printf is right here, and if it were, the third line in the example would have had to fail as well, actually.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]