"C" character set (again)
Corinna Vinschen
corinna-cygwin@cygwin.com
Fri Jan 15 12:03:00 GMT 2010
On Jan 15 12:56, Corinna Vinschen wrote:
> On Jan 15 07:32, Andy Koppe wrote:
> > 2010/1/10 Corinna Vinschen:
> > > Andy Koppe wrote:
> > >> So how about leaving the initial __mbtowc and __wctomb pointers as
> > >> they are?
> > >
> > > It feels so unclean...
> >
> > Does that matter, as long as everything's cleaned up by the time the
> > actual program starts? Speaking of which, what locale context are C++
> > global constructors executed in? Is the filesystem/console charset
> > already set according to the environment by that point?
>
> Yes.
>
> >
> > Here's another concern regarding C changing to ASCII: what would a
> > user who sets LANG=C (or LANG=C.ASCII, for that matter) expect to
> > happen to filenames? Currently, anything non-ASCII would turn into
> > ^X-escaped UTF-8. However, since ASCII doesn't have anything beyond
> > 0x7F (btw, thanks for patching newlib accordingly), the ^X isn't
> > actually necessary and filenames in C(.ASCII) could just use straight
> > UTF-8 anyway.
> >
> > Therefore, would something like the patch below make sense?
>
> I'm pondering this for at least two weeks now. I'm still not sure what
> new problems we add by reverting C to ASCII. As long as the underlying
> charset is UTF-8, I don't see any problems, but that could simply be the
> result of me being too unimaginative.
>
> Anyway, I have something like your patch already in my locale code. I'm
> not setting the cygheap->locale.charset to UTF-8, though. This should
> avoid unnecessary calls to internal_setlocale in child processes. I'll
> apply that, together with setting C to ASCII by default.
>
> And a matching change to the docs.
Can you please review the below patch to the docs? I would like to
make absolutely sure that the description is comprehensive.
Thanks,
Corinna
Index: setup2.sgml
===================================================================
RCS file: /cvs/src/src/winsup/doc/setup2.sgml,v
retrieving revision 1.31
diff -u -p -r1.31 setup2.sgml
--- setup2.sgml 2 Dec 2009 09:36:54 -0000 1.31
+++ setup2.sgml 15 Jan 2010 11:59:43 -0000
@@ -201,17 +201,18 @@ manual pages on the homepage of the
<para>
At application startup, the application's locale is set to the default
-"C" or "POSIX" locale. Under Cygwin, this locale defaults to the UTF-8
-character set. If you want to stick to the "C" locale and only change to
-another charset, you can define this by setting one of the locale environment
-variables to "C.charset". For instance</para>
+"C" or "POSIX" locale. Under Cygwin 1.7.2 and later, this locale defaults
+to the ASCII character set on the application level. If you want to stick
+to the "C" locale and only change to another charset, you can define this
+by setting one of the locale environment variables to "C.charset". For
+instance</para>
<screen>
"C.ISO-8859-1"
</screen>
-<para>The default locale in the absence of the aforementioned locale
-environment variables is "C.UTF-8".</para>
+<note><para>The default locale in the absence of the aforementioned locale
+environment variables is "C.UTF-8".</para></note>
<para>Windows uses the UTF-16 charset exclusively to store the names
of any object used by the Operating System. This is especially important
@@ -244,6 +245,13 @@ lost: If the application calls setlocal
of the important locale variables set in the environment, the locale
is set to the default locale, which is "C.UTF-8".</para>
+<para>But what about applications which are not locale-aware? Per POSIX,
+they are running in the "C" or "POSIX" locale which implies the ASCII
+charset. When the charset is set to ASCII, Cygwin will still use UTF-8
+under the hood to translate filenames. This allows for easier
+interoperability with locale-aware applications running in the default
+"C.UTF-8" locale.</para>
+
<para>
Right now the language and territory, as well as the modifier, are not
important to Cygwin, except to fix a single problem. There's a class of
@@ -275,9 +283,11 @@ How does that work?</para>
<itemizedlist mark="bullet">
<listitem><para>
-The default locale is the "C" or "POSIX" locale. Under Cygwin this locale
-defaults to the UTF-8 character set.</para>
-</listitem>
+The default locale is the "C" or "POSIX" locale per the POSIX requirements.
+As described earlier, under Cygwin 1.7.2 and later this locale defaults to
+the ASCII character set on the application level and UTF-8 on the Cygwin DLL
+level for converting filenames etc.
+</para></listitem>
<listitem><para>
Assume that you've set one of the aforementioned environment variables to some
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
More information about the Cygwin-developers
mailing list