[PATCH/RFA] Internationalize ctype functionality

Corinna Vinschen vinschen@redhat.com
Fri Mar 27 13:22:00 GMT 2009


On Mar 26 21:55, Howland Craig D (Craig) wrote:
> 1)  Wouldn't it be cleaner, especially in files in which it happens more
> than once, to replace things like:
>  
> #ifdef __CYGWIN__
> char __declspec(dllexport) *__ctype_ptr__ = _ctype_b + 127;
> [...]
> char DLLEXPORT *__ctype_ptr__ = _ctype_b + 127;
> 
> (given that the only differences on the lines is the dll attribute)?
> This would not only make ctype_.c more readable, but more maintainable.

I didn't change that.  It's just as it was in the original code.
It's Jeff call.

> 2)  I don't entirely understand the following, possibly due to my lack
> of knowledge on the topic:
> >- The toupper and tolower functions are now charset independent.  If
> the
> >  character is > 0x7f, it will be converted to wide char and then
> >  towupper/towlower is called on it.
> >  This is only a temporary solution.  It works, but it's a bit sedated
> >  for native charaters.  In the long run we should rather add
> >  upper/lower-case transformation tables, similar to the new ctype
> >  character class tables.
> toupper and tolower operate on regular characters, which have a defined
> range of unsigned-char-allowed-values and EOF.  How can it work to
> change it to a wide character except in the degenerate case when wide
> characters are the same width as regular characters?

What the code does is this:

  if (mbtowc (&wc, s, 1) >= 0

- If the character is convertable to a wide char

      && wctomb (s, (wchar_t) towupper ((wint_t) wc)) == 1)

- And the towupper (or towlower) of the result can be converted back
  to a single byte char

    c = s[0];

- Use it.  The wide char conversion is lossless.  If the conversion
  works and the result is a singlebyte char, it's used, otherwise c is
  returned.  I don't see a problem with this approach.

>   That is, should
> it be gated by a check that MB_CUR_MAX == 1?

I'm not quite sure.  While POSIX state that the incoming int must be
representable as an unsigned char, it doesn't explicitely state that
this unsigned char must be from a singlebyte charset.

OTOH, all the other isalpha/isprint/etc functions only work for
singlebyte chars anyway.  And if we start using transition tables
at one point...

> 3)  (both toupper.c and tolower.c do this)
> [...]
> +  if ((unsigned char) c <= 0x7f) 
> +    return isupper (c) ? c - 'A' + 'a' : c;
> +  char s[8] = { c, '\0' };
> +  wchar_t wc;
> +  if (mbtowc (&wc, s, 1) >= 0
> +      && wctomb (s, (wchar_t) towlower ((wint_t) wc)) == 1)
> +    c = s[0];
>  
> The char s[8] and wchar_t lines will not work, coming in the middle
> of a block, unless the compiler is C99 compliant.  Does Newlib assume
> (require) C99 compilers?  (I hope so, but don't think so.)

That was an oversight.  I created that code for Cygwin originally and it
uses what gcc provides.  I'll fixed that together with the constant 8 in
s[8] which should actually be MB_LEN_MAX, and an additional check for
EOF (which is in the domain for tolower/toupper per POSIX).  The new
patch for tolower/toupper looks like this now:

Index: libc/ctype/tolower.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/ctype/tolower.c,v
retrieving revision 1.2
diff -u -p -r1.2 tolower.c
--- libc/ctype/tolower.c	28 Oct 2005 21:33:22 -0000	1.2
+++ libc/ctype/tolower.c	27 Mar 2009 09:55:42 -0000
@@ -46,10 +46,31 @@ No supporting OS subroutines are require
 
 #include <_ansi.h>
 #include <ctype.h>
+#ifdef _MB_CAPABLE
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <wctype.h>
+#include <wchar.h>
+#endif
 
 #undef tolower
 int
 _DEFUN(tolower,(c),int c)
 {
-	return isupper(c) ? (c) - 'A' + 'a' : c;
+#ifdef _MB_CAPABLE
+  if ((unsigned char) c <= 0x7f) 
+    return isupper (c) ? c - 'A' + 'a' : c;
+  else if (c != EOF && MB_CUR_MAX == 1)
+    {
+      char s[MB_LEN_MAX] = { c, '\0' };
+      wchar_t wc;
+      if (mbtowc (&wc, s, 1) >= 0
+	  && wctomb (s, (wchar_t) towlower ((wint_t) wc)) == 1)
+	c = s[0];
+    }
+  return c;
+#else
+  return isupper(c) ? (c) - 'A' + 'a' : c;
+#endif
 }
Index: libc/ctype/toupper.c
===================================================================
RCS file: /cvs/src/src/newlib/libc/ctype/toupper.c,v
retrieving revision 1.2
diff -u -p -r1.2 toupper.c
--- libc/ctype/toupper.c	28 Oct 2005 21:33:22 -0000	1.2
+++ libc/ctype/toupper.c	27 Mar 2009 09:55:42 -0000
@@ -45,10 +45,31 @@ No supporting OS subroutines are require
 
 #include <_ansi.h>
 #include <ctype.h>
+#ifdef _MB_CAPABLE
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <wctype.h>
+#include <wchar.h>
+#endif
 
 #undef toupper
 int
 _DEFUN(toupper,(c),int c)
 {
-  return islower(c) ? c - 'a' + 'A' : c;
+#ifdef _MB_CAPABLE
+  if ((unsigned char) c <= 0x7f)
+    return islower (c) ? c - 'a' + 'A' : c;
+  else if (c != EOF && MB_CUR_MAX == 1)
+    {
+      char s[MB_LEN_MAX] = { c, '\0' };
+      wchar_t wc;
+      if (mbtowc (&wc, s, 1) >= 0
+	  && wctomb (s, (wchar_t) towupper ((wint_t) wc)) == 1)
+	c = s[0];
+    }
+  return c;
+#else
+  return islower (c) ? c - 'a' + 'A' : c;
+#endif
 }


Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat



More information about the Newlib mailing list