This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: The C locale
On Sep 24 12:00, Corinna Vinschen wrote:
> On Sep 24 11:57, Corinna Vinschen wrote:
> > On Sep 24 18:37, IWAMURO Motonori wrote:
> > > - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> > >
> > > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> > >
> > > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> > >
> > > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
> > > This includes "[", "\", "]", "^", "`", "{", "|", "}".
> >
> > Ok, thanks for your examples, they show neatly where the problem is.
> >
> > As you might know, the codepage 20932 (EUC-JP) is also not the same
> > as the UNIX EUC_JP implementation. The JIS-X-0212 three byte codes
> > are folded into two-byte sequences as described in a comment in
> > strfuncs.cc:
> >
> > /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
> > compatible to eucJP. It's a cute approximation which makes it a
> > doublebyte codepage.
> > The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
> > into two byte codes as follows: The 0x8f is stripped, the next byte is
> > taken as is, the third byte is mapped into the lower 7-bit area by
> > masking it with 0x7f. So, for instance, the eucJP code 0x8f,0xdd,0xf8
> > becomes 0xdd,0x78 in CP 20932.
> >
> > To be really eucJP compatible, we have to map the JIS-X-0212 characters
> > between CP 20932 and eucJP ourselves. */
> >
> > My question is this: Is the S-JIS implementation on UNIX systems
> > also using a different implementation to avoid using characters
> > from the ASCII range? If so, can't we change the __sjis_wctomb
> > and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
> > and __eucjp_mbtowc functions to get a safer implementation?
>
> Hmm, as far as I can see from wikipedia, S-JIS is simply defined
> that way. Bah.
This leads me to another question to you and other users working with
Japanese systems.
As far as I understood this, the default ANSI and OEM codepage on
Japanese Windows systems is 932/SJIS, right? And your examples show
nicely how bad codepage 932/SJIS is from a usability perspective.
Right now, if you specify a locale like "ja_JP" on your machine, that
is, without specifying the charset, Cygwin will fetch the ANSI codepage
from Windows and use that as your charset. That means, LANG="ja_JP"
will result in using the charset SJIS.
The question is this: Wouldn't it be better from a usability perspective
to avoid SJIS in this case, and to switch Cygwin to EUCJP instead?
So, for a Japanese user:
LANG="C" -> UTF-8
LANG="ja" -> EUCJP
LANG="ja_JP" -> EUCJP
LANG="ja_JP.SJIS" -> SJIS
That would mean, *only* when specifying SJIS explicitely, Cygwin actually
uses SJIS.
Is that a feasible approach?
Thanks,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple