[PATCH/RFA] Distinguish between EOF and character with value 0xff

Wed Apr 22 00:30:00 GMT 2009

Howland Craig D (Craig <howland <at> LGSInnovations.com> writes:

> 
> Summary:  Yes, we should do this, but there is still a possible problem
> if programmers are not careful--and the library can do nothing to avoid
> that potential problem; this is really only trading one problem for
> another.

It is correcting a real problem for programs that obey the spec, and throwing 
the undefined behavior back on the shoulders of programs that disobey the 
spec.  C99 and POSIX are clear that the isFOO macros/functions operate on EOF 
or data treated as 'unsigned char'.  The fact that we are nice to the 
programmer in 127 out of the 128 cases where using 'signed char' (or 'char' on 
platforms where it is signed) is a QoI issue; in reality, the user is already 
treading on thin ice for trying to pass a signed value in to an isFOO call.

> 
> Details:
> It is impossible for the character functions to differentiate between
> EOF==-1 and 0xFF if the character comes from a signed char, because the
> 0xFF is really -1 for signed char and gets promoted to int -1 for
> passing.

No, you are disobeying the standards if you do:

signed char ch;
isalpha (ch);

The standards are explicit that you must do one of these instead:

isalpha ((unsigned char) ch);

int c = (unsigned char) ch;
isalpha (c);

> The problem also applies to the macros as they currently are defined.
> For example:
> 
> #define isalpha(c)      ((__ctype_ptr__)[(unsigned)((c)+1)]&(_U|_L))

Not a library bug.  According to the standards, since a compliant program is 
already passing in [-1,255], this results in a value in the range [0,256] in 
compliant programs.  For buggy programs, which passed in a signed char (and 
thus a range of [-128,127], with -1 duplicated), all bets are already off.

> An example showing how the problem could surface:
> 
> #include <stdio.h>
> ...
> int i, len;
> signed char buf[80];	// or plain "char" when char defaults to signed
> FILE *fp=fopen(...);
> [change to new locale with 8-bit characters in which 0xFF is legal and
>  considered a control character]
> len = fread(buf, sizeof(buf[0]), sizeof(buf), fp);
> for(i=0; i<len; i++)  if(iscntrl(buf[i]) something(i);

Fix the bug in your program, it's not the library's fault.

> Should any attempt be made to make it cleaner for those who don't?

One way might be to write it in such a way that, for decent compilers, we 
trigger warnings.  Gcc is already smart enough to warn the user about the use 
of a char array index:

$ cat foo.c
char ch;
int array[1];
int main ()
{
  int i = array[ch];
  return i;
}
$ gcc -Wall -c -o foo.o foo.c
foo.c: In function 'main':
foo.c:6: warning: array subscript has type 'char'

[Hmm.  gcc doesn't warn if ch is 'signed char' - time to file a gcc bug report.]

Maybe the trick is to rewrite the macros such that we force the original input 
to be treated as an array index; it is too late after the addition (at which 
point things have been promoted to int if they weren't already).  Untested:

#define isalpha(c) ((*((&__ctype_ptr__[c])+1))&(_U|_L))

> Which at best can only be a partial fix because the functions cannot be
> fixed.  But then if we count on programmers to use unsigned char only,
> then there is no real reason to allow for the -128 to -2 values.  (And
> allowing them could even be considered a disservice, as it makes the
> problem less likely to happen and therefore harder to find.  No, I'm
> not suggesting that we back that part out, but simply pointing out that
> it is of questionable overall value.)

With Corinna's patch, newlib now behaves like glibc, providing the same 
behavior for [-128,-2] as a QoI point for broken programs, but unable to help 
those buggy programs when they reach (signed char) 0xff.

-- 
Eric Blake