This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?

From: Rich Felker <dalias at libc dot org>
To: Carlos O'Donell <carlos at redhat dot com>
Cc: Joseph Myers <joseph at codesourcery dot com>, Paul Eggert <eggert at cs dot ucla dot edu>, GNU C Library <libc-alpha at sourceware dot org>
Date: Thu, 12 Feb 2015 11:08:10 -0500
Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
Authentication-results: sourceware.org; auth=none
References: <54DB8243 dot 3050903 at redhat dot com> <54DBFA8D dot 8030107 at cs dot ucla dot edu> <54DC0546 dot 3080102 at redhat dot com> <alpine dot DEB dot 2 dot 10 dot 1502121207010 dot 10529 at digraph dot polyomino dot org dot uk> <54DCC3A2 dot 9050105 at redhat dot com>

On Thu, Feb 12, 2015 at 10:15:46AM -0500, Carlos O'Donell wrote:
> On 02/12/2015 07:12 AM, Joseph Myers wrote:
> > On Wed, 11 Feb 2015, Carlos O'Donell wrote:
> > 
> >> When I say "like C" I mean that setting "C.UTF8" in LC_ALL would
> >> ignore LANGUAGE, as is required when setting LC_ALL to "C".
> > 
> > "Like C" could also mean that ASCII characters (and probably all 
> > characters) are collated in code-point order (so, for example, all 
> > uppercase ASCII letters come before all lowercase).  Or do you think the 
> > right way to achieve that minimal extension of the C locale to UTF-8 is to 
> > set only LC_CTYPE and not LC_COLLATE or LC_ALL?
> 
> That's an open question. I expect that your instinct is correct and that
> we should collate in code-point order. There should be some deterministic
> ordering such that low-level sorting utilities work reliably.

I agree. C.UTF-8 should behave just like the C locale except in
regards to character set/encoding/identity. This follows existing
practice in the definition of "C.UTF-8" locales (AFAIK, at least) and
existing practice for "[localename].UTF-8" meaning "same as
[localename] but with UTF-8 encoding". In particular, LC_COLLATE
should be codepoint order (which is the same as strcmp order) and
LC_TIME should probably use the C-locale date formatting (ugly as it
is) rather than trying to adopt some nice international format even
though I'd rather see the latter.

Also, based on analogy that en_US.UTF-8 has (almost) all Unicode
letter-class characters identified as alphabetic even though they're
not in the English alphabet, I think the character class functions for
C.UTF-8 should also cover the whole character set, not just the
characters mandated by the C locale.

One more thing -- in the absence of ANY LC_*/LANG vars being set,
POSIX leaves the default locale for setlocale(x, "")
implementation-defined. Would it be justifiable to make C.UTF-8 the
default in this case instead of plain C, so that suppression of UTF-8
support never happens accidentally (e.g. when stripping the
environment for security), only by explicitly setting LC_ALL=C or
similar?

Rich

Follow-Ups:
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell

References:
- Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Paul Eggert
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Joseph Myers
- Re: Should glibc provide a builtin C.UTF-8 locale?
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]