This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The C locale

On Sep  8 22:49, Andy Koppe wrote:
> ps:
> > Maximum 1.5 compatibility (what for and how long?)  vs. maximum
> > default usability in the long run (at least I hope so).
> Compatibilty for users upgrading to 1.7, who are used to being able to
> use the non-ASCII chars in their ANSI codepage, which is usually all
> they care about. And who have files encoded in that codepage, while
> being blissfully unaware what stuff like "LC_CTYPE" or "CP1251" means.
> And who are therefore going to complain about Cygwin 1.7 breaking
> their files.
> Using UTF-8 throughout is a worthwhile aim of course, but it's a bumpy
> road to get there, with lots of apps not yet ready. Moreover, is there
> actually any other OS where the "C" locale uses UTF-8? Afaik, Linuxes
> just set LANG to *.UTF-8 somewhere in the startup scripts.

Back from vacation I re-read this thread now and I have to say I just
don't know what is the best course of action here.

The idea to use UTF-8 for filename and console operations by default was
to get the least problems converting from UTF-16 to multibyte, so that
readdir() always returns a valid filename.  Since the filename is
supposed to be just a NUL-terminated stream of bytes, the application
shouldn't care what the filename looks like, it should just always use
it as is.  In contrast to Linux filesystems, where the filename actually
*is* a simple byte stream, we have to convert the filename back and
forth from and to UTF-16.

As for the conversion of filenames, you get the same problem on Linux if
the filename contains non-ASCII bytes and these bytes are not a valid
multibyte character in the current locale.

Referring to another of your mails in this thread:

> A user with such a setup who upgrades to 1.7 will find that things
> will no longer work as before, since filenames are translated to UTF-8
> whereas the console now seems to use ISO-8859-1 (presumably via the mb
> functions) by default. Hence a file called 'b\344h' in Explorer (with
> a-umlaut in the middle), will show as 'bäh' instead.

That's because the console uses the ascii conversion by default which
is the newlib implementation just passing through all bytes unconverted,
even the >=0x80 ones.  That's ISO-8859-1 conincidentally.  However, that
means the console uses the same conversion as the application.  Only the
filename conversion uses UTF-8.

> And if you try to create 'b\344h' in Cygwin 1.7, you actually get a file
> called 'b', because the '\344' (0xE4) in ISO-8859-1 turns into an
> encoding error when interpreted as UTF-8, and the name simply seems to
> be truncated at that point.

Yes, that *is* a problem.

> I see two good solutions:
> - Use the default Windows codepage for filenames, console, and
> multibyte functions. This is what happens already if you specifiy a
> locale with a language but no charset, e.g. "en". Maximum 1.5
> compatibility.

Hmm, yes, that might be an option.  Allowing the C.UTF-8 locale
could workaround the remaining problems.

> - Use UTF-8 throughout. Full Unicode support out-of-the box.

What means "throughout"?  Do you want ASCII multibyte conversion to 
use UTF-8 as well?  Of course that will still result in problems if
a shell script has a filename hardcoded in, say, CP1252.

> And a cheap'n'nasty one:
> - Restrict the multibyte functions and console to 7-bit ASCII. Still
> means it's inconsistent with the filename conversions, but at least
> non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't
> show at all.

I remember having seen this on Linux as well in some GUI applications.

Apart from that, the fourth solution is to stick to the current
implementation to use UTF-8 for filenames by default and relaxed ASCII
(ISO-8859-1) as provided by newlib for everything else.

The problem is, I don't know for sure what the best appraoch is, and it
seems nobody except you and Iwamuro are actually interested to discuss
this.  And you both have a contrary opinion in this matter.

Personally I have no problem with the current approach.  I understand
the potential problems, but, as usual, solving it one way results in
problems in another scenario and vice versa.


Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Problem reports:
Unsubscribe info:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]