"C" character set (again)

Andy Koppe andy.koppe@gmail.com
Tue Dec 29 14:11:00 GMT 2009


2009/12/29 Eric Blake:
>> Following the "printf treats differently a string constant and a
>> character array" issue at
>> http://cygwin.com/ml/cygwin/2009-12/msg01009.html, I'm wondering again
>> whether the "C" locale shouldn't go back to using ASCII rather than
>> UTF-8, to avoid surprises like that and also to fit with many people's
>> expectation that "C" means ASCII. I think that would save us a bunch
>> of trouble and pointless legal/religious discussions about the C
>> locale.
>
> Bytes with the 8th bit set are not portable in the C locale, regardless of
> whether that locale uses ASCII or UTF-8 encoding.  Yes, we will have to
> field complaints from users with non-portable programs. But I don't think
> we have to change back to ASCII - we are doing those users a service by
> making them fix their portability bugs.

Trouble is, Cygwin currently is the only significant platform where
plain "C" implies UTF-8, as far as I know anyway. While I agree that
POSIX does allow it, this does make it more of a Cygwin problem than a
portability problem from the user's perspective, and they are
certainly not going to thank us for that in any case.

Following the introduction of the "C.UTF-8" default locale, we do no
longer need "C" to imply UTF-8, hence we're causing ourselves
unnecessary pain by sticking with that. There've been several user
questions on this already, and also problems with autoconf and gcc
test cases that assumed that C means ASCII as well as complaints on
legal/philosophical grounds from Thomas Dickey and others. And if the
Debian thread discussing the introduction of C.UTF-8 is anything to go
by, there's going to be a lot more of the latter. (See
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776)

I'm running Cygwin with the patch posted above right now, and things
are working fine. Everything that cares about charsets uses UTF-8 as
before, as do the filesystem, the console, and also the conversion of
the initial environment. The difference is that worries about 8-bit
cleanness in programs that don't call setlocale or that explicitly set
the C locale go away.

Again, I agree that POSIX doesn't require it, but since Cygwin aims
for GNU/Linux compatibility in addition to POSIX I think this is a
change worth making.


> On the other hand, I wonder if it may be possible to special case the
> C.UTF-8 locale to treat invalid byte sequences as pseudo-characters, such
> that we can achieve 8-bit transparency in character contexts such as
> printf rather than failing with EILSEQ.  But such special-casing should be
> reserved for C.UTF-8; locales like en_US.UTF-8 should still fail with
> EILSEQ on invalid sequences.

That seems hacky and inconsistent.

Andy



More information about the Cygwin-developers mailing list