1. Problem Statement
Modern systems need a modern encoding system to deal with global data. The old customs of parsing data as ASCII (or ISO 8859-1) is long past and has no business in the 21st century. People still hitting mojibake today is deplorable.
However, there is no way to select today to select UTF-8 encoding without also picking a country/language locale. Many projects hardcode en_US.UTF-8, or maybe try one or two more (like en_GB.UTF-8 and de_DE.UTF-8), before giving up and failing. This is also why distros often do not select a UTF-8 locale by default since the related locale attributes are undesirable.
Python blazed an admirable trail here by putting encoding front and center with its 3.x series. Yet it still runs into a problem where it has to guess as to the encoding of stdin/stdout/stderr. By making C.UTF-8 available, this can be handled gracefully.
The new locale name shall be C.UTF-8. It shall be the C locale but with UTF-8 encodings.
Setting LC_ALL=C.UTF-8 will ignore LANGUAGE just like it does with LC_ALL=C. See guess_category_value() in intl/dcigettext.c and how it checks "C" for more details.
These will be the same as C (except for any _NL_xxx_CODESET fields will be UTF-8):
These will be different from C:
LC_IDENTIFICATION: Mentioned for completeness
LC_COLLATE: Sort using the Unicode codepoint
LC_CTYPE: UTF-8 encoding
Initially for testing/speed, we can make the locale an external file for people to generate with localedef. This will allow us to test the waters too with distros/users.
However, long term we want it to be guaranteed to be available. That means it has to be compiled in like the C locale is today. This has some implications:
- Increased static footprint due to new data structures
POSIX does not require any specific locale be the default for programs:
All implementations shall define a locale as the default locale, to be invoked when no environment variables are set, or set to the empty string. This default locale can be the POSIX locale or any other implementation-defined locale.
Glibc shall provide a way to control this default via a configure option, and the default default shall be C.UTF-8.
This default shall apply when programs initialize themselves to use the default locale; e.g. setlocale(LC_ALL, "").
3. Other Art
3.4. OS X
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html POSIX locale spec
https://sourceware.org/bugzilla/show_bug.cgi?id=17318 [RFE] Provide a C.UTF-8 locale by default
https://sourceware.org/bugzilla/show_bug.cgi?id=16621 LANGUAGE handling w/C.UTF-8
https://bugzilla.redhat.com/show_bug.cgi?id=902094 [RFE] Provide a C.UTF-8 locale
https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html glibc discussion
http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html Discussion of gnulib behavior under OS X
- Gentoo threads
https://archives.gentoo.org/gentoo-dev/message/2ffb7ea72e6209439600c371f6fc071d Feb 2012 (en_GB.UTF-8)
https://archives.gentoo.org/gentoo-dev/message/92db31f3e5415a74d07d01aeacb38c46 Apr 2012 (serial consoles)
https://archives.gentoo.org/gentoo-dev/message/df644446e46d7fb1a048dbf165ec7866 Jul 2012 (en_US.UTF-8)
https://docs.python.org/3/library/sys.html#sys.stdin Python handling of stdin/stdout/stderr encoding