This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: Joseph Myers <joseph at codesourcery dot com>, Paul Eggert <eggert at cs dot ucla dot edu>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 12 Feb 2015 11:08:10 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <54DBFA8D dot 8030107 at cs dot ucla dot edu> <54DC0546 dot 3080102 at redhat dot com> <alpine dot DEB dot 2 dot 10 dot 1502121207010 dot 10529 at digraph dot polyomino dot org dot uk> <54DCC3A2 dot 9050105 at redhat dot com>
On Thu, Feb 12, 2015 at 10:15:46AM -0500, Carlos O'Donell wrote:
> On 02/12/2015 07:12 AM, Joseph Myers wrote:
> > On Wed, 11 Feb 2015, Carlos O'Donell wrote:
> >> When I say "like C" I mean that setting "C.UTF8" in LC_ALL would
> >> ignore LANGUAGE, as is required when setting LC_ALL to "C".
> > "Like C" could also mean that ASCII characters (and probably all
> > characters) are collated in code-point order (so, for example, all
> > uppercase ASCII letters come before all lowercase). Or do you think the
> > right way to achieve that minimal extension of the C locale to UTF-8 is to
> > set only LC_CTYPE and not LC_COLLATE or LC_ALL?
> That's an open question. I expect that your instinct is correct and that
> we should collate in code-point order. There should be some deterministic
> ordering such that low-level sorting utilities work reliably.
I agree. C.UTF-8 should behave just like the C locale except in
regards to character set/encoding/identity. This follows existing
practice in the definition of "C.UTF-8" locales (AFAIK, at least) and
existing practice for "[localename].UTF-8" meaning "same as
[localename] but with UTF-8 encoding". In particular, LC_COLLATE
should be codepoint order (which is the same as strcmp order) and
LC_TIME should probably use the C-locale date formatting (ugly as it
is) rather than trying to adopt some nice international format even
though I'd rather see the latter.
Also, based on analogy that en_US.UTF-8 has (almost) all Unicode
letter-class characters identified as alphabetic even though they're
not in the English alphabet, I think the character class functions for
C.UTF-8 should also cover the whole character set, not just the
characters mandated by the C locale.
One more thing -- in the absence of ANY LC_*/LANG vars being set,
POSIX leaves the default locale for setlocale(x, "")
implementation-defined. Would it be justifiable to make C.UTF-8 the
default in this case instead of plain C, so that suppression of UTF-8
support never happens accidentally (e.g. when stripping the
environment for security), only by explicitly setting LC_ALL=C or