"C" character set (again)

Corinna Vinschen corinna-cygwin@cygwin.com
Fri Jan 15 12:03:00 GMT 2010


On Jan 15 12:56, Corinna Vinschen wrote:
> On Jan 15 07:32, Andy Koppe wrote:
> > 2010/1/10 Corinna Vinschen:
> > > Andy Koppe wrote:
> > >> So how about leaving the initial __mbtowc and __wctomb pointers as
> > >> they are?
> > >
> > > It feels so unclean...
> > 
> > Does that matter, as long as everything's cleaned up by the time the
> > actual program starts? Speaking of which, what locale context are C++
> > global constructors executed in? Is the filesystem/console charset
> > already set according to the environment by that point?
> 
> Yes.
> 
> > 
> > Here's another concern regarding C changing to ASCII: what would a
> > user who sets LANG=C (or LANG=C.ASCII, for that matter) expect to
> > happen to filenames? Currently, anything non-ASCII would turn into
> > ^X-escaped UTF-8. However, since ASCII doesn't have anything beyond
> > 0x7F (btw, thanks for patching newlib accordingly), the ^X isn't
> > actually necessary and filenames in C(.ASCII) could just use straight
> > UTF-8 anyway.
> > 
> > Therefore, would something like the patch below make sense?
> 
> I'm pondering this for at least two weeks now.  I'm still not sure what
> new problems we add by reverting C to ASCII.  As long as the underlying
> charset is UTF-8, I don't see any problems, but that could simply be the
> result of me being too unimaginative.
> 
> Anyway, I have something like your patch already in my locale code.  I'm
> not setting the cygheap->locale.charset to UTF-8, though.  This should
> avoid unnecessary calls to internal_setlocale in child processes.  I'll
> apply that, together with setting C to ASCII by default.
> 
> And a matching change to the docs.

Can you please review the below patch to the docs?  I would like to
make absolutely sure that the description is comprehensive.


Thanks,
Corinna


Index: setup2.sgml
===================================================================
RCS file: /cvs/src/src/winsup/doc/setup2.sgml,v
retrieving revision 1.31
diff -u -p -r1.31 setup2.sgml
--- setup2.sgml	2 Dec 2009 09:36:54 -0000	1.31
+++ setup2.sgml	15 Jan 2010 11:59:43 -0000
@@ -201,17 +201,18 @@ manual pages on the homepage of the
 
 <para>
 At application startup, the application's locale is set to the default
-"C" or "POSIX" locale.  Under Cygwin, this locale defaults to the UTF-8
-character set.  If you want to stick to the "C" locale and only change to
-another charset, you can define this by setting one of the locale environment
-variables to "C.charset".  For instance</para>
+"C" or "POSIX" locale.  Under Cygwin 1.7.2 and later, this locale defaults
+to the ASCII character set on the application level.  If you want to stick
+to the "C" locale and only change to another charset, you can define this
+by setting one of the locale environment variables to "C.charset".  For
+instance</para>
 
 <screen>
   "C.ISO-8859-1"
 </screen>
 
-<para>The default locale in the absence of the aforementioned locale
-environment variables is "C.UTF-8".</para>
+<note><para>The default locale in the absence of the aforementioned locale
+environment variables is "C.UTF-8".</para></note>
 
 <para>Windows uses the UTF-16 charset exclusively to store the names
 of any object used by the Operating System.  This is especially important
@@ -244,6 +245,13 @@ lost:  If the application calls setlocal
 of the important locale variables set in the environment, the locale
 is set to the default locale, which is "C.UTF-8".</para>
 
+<para>But what about applications which are not locale-aware?  Per POSIX,
+they are running in the "C" or "POSIX" locale which implies the ASCII
+charset.  When the charset is set to ASCII, Cygwin will still use UTF-8
+under the hood to translate filenames.  This allows for easier
+interoperability with locale-aware applications running in the default
+"C.UTF-8" locale.</para>
+
 <para>
 Right now the language and territory, as well as the modifier, are not
 important to Cygwin, except to fix a single problem.  There's a class of
@@ -275,9 +283,11 @@ How does that work?</para>
 <itemizedlist mark="bullet">
 
 <listitem><para>
-The default locale is the "C" or "POSIX" locale.  Under Cygwin this locale
-defaults to the UTF-8 character set.</para>
-</listitem>
+The default locale is the "C" or "POSIX" locale per the POSIX requirements.
+As described earlier, under Cygwin 1.7.2 and later this locale defaults to
+the ASCII character set on the application level and UTF-8 on the Cygwin DLL
+level for converting filenames etc.
+</para></listitem>
 
 <listitem><para>
 Assume that you've set one of the aforementioned environment variables to some

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat



More information about the Cygwin-developers mailing list