This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The C locale

Christopher Faylor:
>Andy Koppe:
>>Trying to reply to [banned]'s post about locale issues, I got
>>rather confused about the C locale. The manual and the POSIX standard
>>say that it supports ASCII only, so in theory anything above 0x7F
>>should be rejected. In practice though, both Cygwin 1.5 and 1.7 do
>>support characters above 0x7F in the C locale, which could be quite
>>useful. Trouble is, they do so rather inconsistenly.
>>Both in 1.5 and 1.7, the mb conversion functions treat such characters
>>as ISO-8859-1. In other words, conversion between chars and wchars are
>>simple casts (except that wchars above 0xFF can't be converted). This
>>makes some sense.
>>Filename handling is different though. Cygwin 1.5 translates filenames
>>according to the system's ANSI codepage. I guess the inconsistency
>>with the mb functions didn't really matter, as the mb functions were
>>pretty much useless anyway, and supporting the system codepage was
>>more important.
>>So, with Cygwin 1.7, I'd have expected filename handling in the C
>>locale to either use ISO-8859-1 for consistency with the mb functions,
>>or the ANSI codepage for compatibility with 1.5. In actual fact
>>though, it uses UTF-8.
>>Is this on purpose? If so, shouldn't the multibyte conversions
>>functions in the C locale use UTF-8 as well?
>Since Cygwin has a clear system that it is supposed to be emulating,
>the real question is "What does Linux do?"

Tried it on Debian and Suse: the multibyte conversion functions are
strict ASCII, i.e. anything
beyond 0x7F is considered an encoding error.

POSIX requires that ASCII is supported in the C locale, but does not
actually outlaw ASCII-compatible extensions beyond that.

Locales don't affect filenames on Linux, i.e. any sequence of bytes
passed to open() goes straight to disk (except for the path
separator). This effectively means that filenames are encoded in
whatever charset happened to be active at the time the file was
created. Hence anyone accessing it with a different charset setting
will get gibberish.

POSIX is impressively unhelpful on the topic of filenames. All it
guarantees for filenames is the "portable filename character set":
ASCII letters and digits, plus the hyphen, dot, and underscore.

So altogether we've got no fewer than four choices here:
- strict ASCII (as with Linux mb functions)
- ISO-8859-1 (as with newlib mb functions)
- Default Windows ANSI/OEM codepage (as with Cygwin 1.5 filenames)
- UTF-8 (as with Cygwin 1.7 filenames)

In Cygwin 1.5, both file operations and the console use the default
Windows codepage, which often contains all the characters a user cares
about. If you set up readline for 8-bit I/O and change the console
font to something useful, this works reasonably well, including
Cygwin-created filenames showing up correctly in Explorer.

A rather important exception is 'ls', which seems to have its own
hardcoded limitation to 7 bits for the C locale: anything non-ASCII is
shown as '? there'. Things do work correctly elsewhere though, e.g. in
bash tab completion or Midnight Commander.

A user with such a setup who upgrades to 1.7 will find that things
will no longer work as before, since filenames are translated to UTF-8
whereas the console now seems to use ISO-8859-1 (presumably via the mb
functions) by default. Hence a file called 'bÃh' in Explorer (with
a-umlaut in the middle), will show as 'bÃÂh' instead.

And if you try to create 'bÃh' in Cygwin 1.7, you actually get a file
called 'b', because the 'Ã' (0xE4) in ISO-8859-1 turns into an
encoding error when interpreted as UTF-8, and the name simply seems to
be truncated at that point.

I see two good solutions:
- Use the default Windows codepage for filenames, console, and
multibyte functions. This is what happens already if you specifiy a
locale with a language but no charset, e.g. "en". Maximum 1.5
- Use UTF-8 throughout. Full Unicode support out-of-the box.

And a cheap'n'nasty one:
- Restrict the multibyte functions and console to 7-bit ASCII. Still
means it's inconsistent with the filename conversions, but at least
non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't
show at all.


Problem reports:
Unsubscribe info:

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]