This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH/RFA] Internationalize ctype functionality

From: Corinna Vinschen <vinschen at redhat dot com>
To: newlib at sourceware dot org
Date: Sat, 28 Mar 2009 11:51:14 +0100
Subject: Re: [PATCH/RFA] Internationalize ctype functionality
References: <20090326210123.GS12738@calimero.vinschen.de> <3862C5643B15B6468269546753EB2A9202539293@BLTSXVS01.govsolutions.com> <20090327100016.GA857@calimero.vinschen.de> <3862C5643B15B6468269546753EB2A920256E760@BLTSXVS01.govsolutions.com>
Reply-to: newlib at sourceware dot org

On Mar 27 21:13, Howland Craig D (Craig) wrote:
> OTOH, does it make sense to only do tolower and toupper but not the
> rest of the others at the same time?  (Should these tolower and toupper
> changes be tabled until later?)

What rest?  isupper/islower etc?  That's done with my patch?

> That is, will either of them change the 1-byte value into a different
> 1-byte value?

Yes.

> Couldn't then the value just be given straight to towlower() and the
> return therefrom used directly?  (It would be much more efficient.)
> For example,
> 
>    else if (c != EOF && MB_CUR_MAX == 1)
> -    {
> -      char s[MB_LEN_MAX] = { c, '\0' };
> -      wchar_t wc;
> -      if (mbtowc (&wc, s, 1) >= 0
> -	  && wctomb (s, (wchar_t) towupper ((wint_t) wc)) == 1)
> -       c = s[0];
> +      c = (unsigned char) towupper ((wint_t) (unsigned char) c);
> -    }

No, that doesn't work.  It only works for the ISO-8859-1 charset
because the character set from ISO-8859-1 forms the base Unicode
Latin1 plain from 0xa0 to 0xff.  So, only for ISO-8859-1 the singlebyte
character value is equal to the wide char value.

Assume you're not using ISO-8859-1 but ISO-8859-3 instead, Latin 3
instead.  This codepage contains charcters which are not in the base
Latin 1 plain.  For example

  tolower(0xaf) == 0xbf

The character 0xaf in ISO-8859-3 is a latin Z with a dot above.  The
Unicode representation of this character is 0x017b.  The lower case
equivalent is Unicode 0x017c.  Transform it back to ISO-8859-3 and you
get 0xbf, the latin z with dot above in ISO-8859-3 representation.

Do as you suggest and 0xaf is converted to Unicode 0xaf, which is the
macron sign, a punctuation character, which obviously has no lower
case equivalent.  Result: 0xaf.

Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat

References:
- [PATCH/RFA] Internationalize ctype functionality
  - From: Corinna Vinschen
- RE: [PATCH/RFA] Internationalize ctype functionality
  - From: Howland Craig D (Craig)
- Re: [PATCH/RFA] Internationalize ctype functionality
  - From: Corinna Vinschen
- RE: [PATCH/RFA] Internationalize ctype functionality
  - From: Howland Craig D (Craig)

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]