This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?

On Thu, Feb 12, 2015 at 10:15:46AM -0500, Carlos O'Donell wrote:
> On 02/12/2015 07:12 AM, Joseph Myers wrote:
> > On Wed, 11 Feb 2015, Carlos O'Donell wrote:
> > 
> >> When I say "like C" I mean that setting "C.UTF8" in LC_ALL would
> >> ignore LANGUAGE, as is required when setting LC_ALL to "C".
> > 
> > "Like C" could also mean that ASCII characters (and probably all 
> > characters) are collated in code-point order (so, for example, all 
> > uppercase ASCII letters come before all lowercase).  Or do you think the 
> > right way to achieve that minimal extension of the C locale to UTF-8 is to 
> > set only LC_CTYPE and not LC_COLLATE or LC_ALL?
> That's an open question. I expect that your instinct is correct and that
> we should collate in code-point order. There should be some deterministic
> ordering such that low-level sorting utilities work reliably.

I agree. C.UTF-8 should behave just like the C locale except in
regards to character set/encoding/identity. This follows existing
practice in the definition of "C.UTF-8" locales (AFAIK, at least) and
existing practice for "[localename].UTF-8" meaning "same as
[localename] but with UTF-8 encoding". In particular, LC_COLLATE
should be codepoint order (which is the same as strcmp order) and
LC_TIME should probably use the C-locale date formatting (ugly as it
is) rather than trying to adopt some nice international format even
though I'd rather see the latter.

Also, based on analogy that en_US.UTF-8 has (almost) all Unicode
letter-class characters identified as alphabetic even though they're
not in the English alphabet, I think the character class functions for
C.UTF-8 should also cover the whole character set, not just the
characters mandated by the C locale.

One more thing -- in the absence of ANY LC_*/LANG vars being set,
POSIX leaves the default locale for setlocale(x, "")
implementation-defined. Would it be justifiable to make C.UTF-8 the
default in this case instead of plain C, so that suppression of UTF-8
support never happens accidentally (e.g. when stripping the
environment for security), only by explicitly setting LC_ALL=C or


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]