C.UTF-8 locale

1. Problem Statement

Modern systems need a modern encoding system to deal with global data. The old customs of parsing data as ASCII (or ISO 8859-1) is long past and has no business in the 21st century. People still hitting mojibake today is deplorable.

However, there is no way to select today to select UTF-8 encoding without also picking a country/language locale. Many projects hardcode en_US.UTF-8, or maybe try one or two more (like en_GB.UTF-8 and de_DE.UTF-8), before giving up and failing. This is also why distros often do not select a UTF-8 locale by default since the related locale attributes are undesirable.

Python blazed an admirable trail here by putting encoding front and center with its 3.x series. Yet it still runs into a problem where it has to guess as to the encoding of stdin/stdout/stderr. By making C.UTF-8 available, this can be handled gracefully.

2. Proposal

The world has largely settled on the Unicode standard with UTF-8 as the leading encoding format. Hence we will provide an amalgamation of POSIX's C locale with UTF-8 encoding.

The new locale name shall be C.UTF-8. It shall be the C locale but with UTF-8 encodings.

Setting LC_ALL=C.UTF-8 will ignore LANGUAGE just like it does with LC_ALL=C. See guess_category_value() in intl/dcigettext.c and how it checks "C" for more details.

These will be the same as C (except for any _NL_xxx_CODESET fields will be UTF-8):

These will be different from C:

2.1. Builtin

Initially for testing/speed, we can make the locale an external file for people to generate with localedef. This will allow us to test the waters too with distros/users.

However, long term we want it to be guaranteed to be available. That means it has to be compiled in like the C locale is today. This has some implications:

2.2. Defaults

POSIX does not require any specific locale be the default for programs:

Glibc shall provide a way to control this default via a configure option, and the default default shall be C.UTF-8.

This default shall apply when programs initialize themselves to use the default locale; e.g. setlocale(LC_ALL, "").

3. Other Art

3.1. POSIX

TODO

3.2. Debian

TODO

3.3. Fedora/RedHat

TODO https://lists.fedoraproject.org/pipermail/glibc/2015-September/000081.html

3.4. OS X

TODO

4. References

None: Proposals/C.UTF-8 (last edited 2015-11-11 18:24:33 by MikeFrysinger)