This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: multiple (> 10) glibc 2.3.2 bugs related to l10n/i18n (with tests)

From: mjn3 at codepoet dot org (Manuel Novoa III)
To: Ulrich Drepper <drepper at redhat dot com>
Cc: libc-alpha at sources dot redhat dot com
Date: Wed, 10 Sep 2003 05:10:47 -0600
Subject: Re: multiple (> 10) glibc 2.3.2 bugs related to l10n/i18n (with tests)
References: <20030905174448.GA14921@codepoet.org> <3F5EB8A8.3070509@redhat.com>
Hello,

On Tue, Sep 09, 2003 at 10:37:44PM -0700, Ulrich Drepper wrote:
> Manuel Novoa III wrote:
> > As it has been a couple of days since I posted the glibc bug list and

   http://sources.redhat.com/ml/bug-glibc/2003-09/msg00032.html

> > I've seen no comment, I wanted to bring the following to your attention.
> 
> Don't send to the FSF mailing list.  The only safe way is to use
> libc-alpha@sources.redhat.com.  This is clearly documented in each and
> every release note.

Appologies.  I don't follow glibc releases that closely.  But I do test
againt it, and took the time to write up and post the test apps as a
courtesy.

> > In that post, I included the following tests (with somewhat descriptive
> > names) illustrating a number of bugs in glibc 2.3.2.
> 
> I look at the tests.  Some were indeed bugs and I fixed those (mostly).
>  The others are no bugs.  Either the other implementation you are using
> is wrong or you complain about unspecified behavior.

I don't recall "complaining".  In fact, I can only think of one test that
you might be referring to, unget_putc_segfault.c, and I specificly stated
that the behavior was undefined by the standards.  I mentioned it because
I discovered it while testing unusual cases, and it went against stated
glibc behavior in the manual.

Regarding the others that you do not consider bugs, discussion follows.

**********************************************************************

> >   collation-undefined-bug.c
> 
> Wrong assumption.  Look at the properties of U000A in the locale
> specification.  It only has a value in the four column which is
> right,position.  This is why undefined sorts first.

Well, position is not specified for any of th_TH, cs_CZ, ja_JP, or ko_KR.
As far as en_US (iso_14651_t), I've given my interpretation below as well.

iso14651_t  specifies

   order_start <SPECIAL>;forward;backward;forward;forward,position
   <U000A> IGNORE;IGNORE;IGNORE;<U000A>
   order_end

   Since UNDEFINED is not specified, all undefined characters are placed
   at the end of the character collation order.  Since there was an order_end,
   "my interpretation" of DTR 14652 (specificly n897-14652w25.pdf) would be
   that the equivalent of "order_start forward" would be in effect.  Since
   <U000A> is IGNORED at the first level, you would effectively be comparing
   "\xFFFD" to "" and hence "\xFFFD\x000A" should sort after "\x000A".

th_TH collation specifies

   order_start   forward;forward;forward;forward
   UNDEFINED      IGNORE;IGNORE;IGNORE;IGNORE

   Since both <UFFFD> and <U000A> are undefined implicitly, and
   position is not specified at any weight level, both <U000A>
   and <UFFFD> should be completely ignored at all levels, and
   the strings should be treated as equal.

cs_CZ collation specifies

   order_start forward;forward;forward;forward
   <U000A> IGNORE;IGNORE;IGNORE;<U000A>
   UNDEFINED       IGNORE;IGNORE;IGNORE;IGNORE

   Since <UFFFD> is undefined implictly, position is not specified
   at any weight level, and <U000A> has a weight at the 4th level,
   <U000A> should sort after <UFFFD>.

ja_JP and ko_KR both specify collation rules as

   order_start forward
   <U000A>
   UNDEFINED

   There is only a single weight level, and position is not specified.
   <UFFFD> is undefined implictly and should sort after <U000A>.

**********************************************************************

> >   locale-initialization-bug-1.c
> 
> I've no idea what you think is a problem.  The current glibc behavior
> seems 100% correct.

Perhaps it has been fixed since.  But I tested with glibc from both debian testing
and redhat 9.  The problem I was seeing (as illustrated in the diff at the top of
the example) was that in the fa_IR.UTF-8 locale, wprintf is generating incorrect
output.

Maybe it will be clearer looking at a hexdump.  Both fprintf(stderr,...) and
wprintf(...) are using "%ls" to output the same wchar_t buf[].  Yet wprintf()
is outputing '?' (0x3f) whenever it encounters one of whars without ASCII
equivalent.  The output for both fprintf() and wprintf() should have been
identical.

$ ./locale-initialization-bug-1 2>&1 | tail -n 8 | hexdump -C
00000000  20 20 6e 3d 32 36 20 20  62 75 66 3d 22 db b1 d9  |  n=26  buf="...|
00000010  ac db b2 db b3 db b4 d9  ac db b5 db b6 db b7 d9  |................|
00000020  ac db b8 db b9 db b0 22  0a 20 20 6e 3d 32 36 20  |.......".  n=26 |
00000030  20 62 75 66 3d 22 3f 3f  3f 3f 3f 3f 3f 3f 3f 3f  | buf="??????????|
00000040  3f 3f 3f 22 0a 20 20 6e  3d 31 33 20 20 62 75 66  |???".  n=13  buf|
00000050  3d 22 db b1 d9 ac db b2  db b3 db b4 d9 ac db b5  |="..............|
00000060  db b6 db b7 d9 ac db b8  db b9 db b0 22 0a 20 20  |............".  |
00000070  6e 3d 31 33 20 20 62 75  66 3d 22 3f 3f 3f 3f 3f  |n=13  buf="?????|
00000080  3f 3f 3f 3f 3f 3f 3f 3f  22 0a 20 20 6e 3d 32 34  |????????".  n=24|
00000090  20 20 62 75 66 3d 22 31  d9 ac 32 33 34 d9 ac 35  |  buf="1..234..5|
000000a0  36 37 d9 ac 38 39 30 d9  ab 30 30 30 30 30 30 22  |67..890..000000"|
000000b0  0a 20 20 6e 3d 32 34 20  20 62 75 66 3d 22 31 3f  |.  n=24  buf="1?|
000000c0  32 33 34 3f 35 36 37 3f  38 39 30 3f 30 30 30 30  |234?567?890?0000|
000000d0  30 30 22 0a 20 20 6e 3d  32 30 20 20 62 75 66 3d  |00".  n=20  buf=|
000000e0  22 31 d9 ac 32 33 34 d9  ac 35 36 37 d9 ac 38 39  |"1..234..567..89|
000000f0  30 d9 ab 30 30 30 30 30  30 22 0a 20 20 6e 3d 32  |0..000000".  n=2|
00000100  30 20 20 62 75 66 3d 22  31 3f 32 33 34 3f 35 36  |0  buf="1?234?56|
00000110  37 3f 38 39 30 3f 30 30  30 30 30 30 22 0a        |7?890?000000".|
0000011e

**********************************************************************

> >   printf-illegal-mb-precision-bug.c
> 
> Wrong assumption.  Once you use invalid byte sequences you're on your
> own.  There is no requirement for the implementation to be gracious
> about this.  It is impossible to do get into this state in a legal way.

What invalid byte sequences are you referring to.  For printf, %s doesn't
care a whit what the bytes are... other than the (possible) nul terminator.
And precision for %s in printf counts _bytes_.  Why would you check that
they form a valid multibyte sequence?  (This is in contrast to the format
string itself, which the standards says _is_ a multibyte sequence.)

You seem to have missed the footnote in the ANSI/ISO C99 spec.  Let me repeat

  If no l length modifier is present, the argument shall be a pointer to the
  initial element of an array of character type. {237: No special provisions
  are made for multibyte characters.}  Characters from the array are written
  up to (but not including) the terminating null character. If the precision
  is specified, no more than that many bytes are written.

While one might quibble about whether a "character" written from the array is
a 'char' or an mb char, it goes on to specify that the precision limit is in bytes.

Besides, the Single Unix Specification Version 3 is not ambiguous at all.
  
  The argument shall be a pointer to an array of char. _Bytes_ {my emphasis} from
  the array shall be written up to (but not including) any terminating null byte.
  If the precision is specified, no more than that many bytes shall be written.
  If the precision is not specified or is greater than the size of the array, the
  application shall ensure that the array contains a null byte.

**********************************************************************

> >   scanf-c-bug.c
> 
> Wrong assumption.  %7c means "up to 7 characters".  This is how every
> implementation I'm familiar with interprets it.  And I think even the
> POSIX test suite require this.  If you don't agree file a bug for the
> ISO C committee to look at.

It isn't my "assumption".  It is what is specified in the standard.
The ANSI/ISO C99 spec states

   c Matches a sequence of characters of _exactly_ {my emphasis} the number
     specified by the field width (1 if no field width is present in the
     directive).

and the Single Unix Specification Version 3 is in agreement, stating

    c

    Matches a sequence of bytes of the _number_specified_ {my emphasis} by the
    field width (1 if no field width is present in the conversion specification).
    The application shall ensure that the corresponding argument is a pointer to
    the initial byte of an array of char, signed char, or unsigned char large
    enough to accept the sequence. No null byte is added. The normal skip over
    white-space characters shall be suppressed in this case.

An example of a scanf implementation where %7c means "exactly 7 characters" is
the code pulished in P.J. Plauger's "The Standard C Library".

Now, if you think the standard should be changed, perhaps _you_ should file a
bug report.  If the standard gets changed, I'll happily update my code.  Until
then, you might want to note the difference in the glibc CONFORMANCE document
if you are unwilling to comply.

**********************************************************************

> >   scanf-int-grouping-bug.c
> 
> Your expectations are very wrong.  Dangerously so.

No... It is just that you don't have a clue as to what my "expectations" are.
You obviously didn't pay any attention to the comments and diff in the test
app.  Please pay attention this time around.

> The C standard
> requires only one character pushback (ungetc).  I.e., every single

Undisputed.

> function implementation, including scanf() etc, must not behave as if it
> could push back more.

As discussed below, the "unget" machinery of scanf and ungetc are _not_
coupled as you beleive... at least not according to ANSI/ISO C99.  But even
if you want dispute that, the glibc scanf behavior is _still_ broken.

> Since it it not possible to determine ahead of
> time when to stop in these cases only the first completely invalid
> character or the final NUL can be pushed back.

Take the first example...  " 12,34x" in the en_US locale, with groupings of
three digits.  For the conversion "%n%'i%n%s", glibc's scanf was returning
2, with the integer set to 12 and the string set to "x".  I'm _not_ saying
that the ",34" should be pushed back.  I'm saying that the conversion
_should_have_failed_.  If you had bothered to look at the included diff, you
might have noticed that.

Now, I'm well aware of what the C standard says.  While it is true that
you are only portably guaranteed one character of pushback via ungetc(),
this differs from the scanf mechanism.  Below are some quotes from the
ANSI/ISO C99 spec.  Note in the description of scanf how it specificly
states that the non-matching input remains "unread"... not pushed back,
but unread.

   7.19.7.11 The ungetc function
   int ungetc(int c, FILE *stream);
   ...
   3 One character of pushback is guaranteed. If the ungetc function is called
     too many times on the same stream without an intervening read or file
     positioning operation on that stream, the operation may fail.


   7.19.6.2 The fscanf function
   int fscanf(FILE * restrict stream, const char * restrict format, ...);

   5 A directive composed of whitespace character(s) is executed by reading
     input up to the first nonwhitespace character (which remains unread), or
     until no more characters can be read.
   6 A directive that is an ordinary multibyte character is executed by reading
     the next characters of the stream. If any of those characters differ from the
     ones composing the directive, the directive fails and the differing and
     subsequent characters remain unread.  Similarly, if endoffile, an encoding
     error, or a read error prevents a character from being read, the directive
     fails.
   7 A directive that is a conversion specification defines a set of matching
     input sequences, as described below for each specifier. A conversion
     specification is executed in the following steps:
   8 Input whitespace characters (as specified by the isspace function) are
     skipped, unless the specification includes a [, c,orn specifier. 241) 
   9 An input item is read from the stream, unless the specification includes an
     n specifier. An input item is defined as the longest sequence of input
     characters which does not exceed any specified field width and which is, or
     is a prefix of, a matching input sequence. 242) The first character, if any,
     after the input item remains unread. If the length of the input item is zero,
     the execution of the directive fails; this condition is a matching failure
     unless endoffile, an encoding error, or a read error prevented input from
     the stream, in which case it is an input failure.

Also, in his discussion of the implementation of scanf in "The Standard C Library",
P.J. Plauger writes:

   When either fscanf or scanf obtains such an unexpected character, it pushes it
   back to the input stream.  (It also pushes back the first character beyond a
   valid field when it has to peek ahed to determine the end of the field.)  How it
   does so is similar to calling the function "ungetc".  There is a very important
   difference, however.  You cannot portably push back two characters to a stream
   with successive calls to "ungetc" (and no other intervening operations on the
   stream).  You _can_ portably follow an arbitray call to a scan function with a
   call to "ungetc" for the same stream.

   What this means effectively is that the one-character pushback limit imposed
   on "ungetc" is not compromised by calls to the scan functions.  Either the
   implementation guarantees two or mor characters of pushback to a stream, or
   it provides seperate machinery for the scan functions.

   The scan functions push back at most one character. ....

In my stdio implementation, I maintain 2 characters of pushback, but only one
is available for the user via ungetc().  Care needs to be taken to distinguish
between the cases, as there are implications for the file positioning functions.
Also, the user if free to call ungetc() with a character different from the last
read.

**********************************************************************

> >   scanf-missing-digits-bug.c
> 
> Similarly.  The longest valid prefix is interpreated and the additional
> characters read but not used are ignored.

Sigh.... Look at "0xp!\n". For the format, "%f%n%s" glibc's scanf is returning 
a value of 2 with the float set to 0.000000 and the string set to "!".  Again,
this conversion _should_have_failed_.  The sequence does not match a valid
hexadecimal floating point string and by the time the matching failure is
detected, the situation is unrecoverable.

**********************************************************************

> If you don't agree with any of this bring this up with to the
> appropriate standards committee.

I realize you are busy.  Well, so am I.  But I took the time to code up
and submit 11 tests illustrating bugs I had found in glibc, in spite of
the fact that they don't affect me at all.  As I said, I only came across
them while running comparison tests.

But it is very apparent that you only glanced at the tests you are
discounting.  Futhermore, the tone of your reply could generally
be described as demeaning.  How disappointing... and unnecessary.

>  My position is fixed.

Well, unfortunately your C library doesn't seem to be...

Manuel
Follow-Ups:
- Re: multiple (> 10) glibc 2.3.2 bugs related to l10n/i18n (with tests)
  - From: Petter Reinholdtsen
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]