This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Paul Eggert <eggert at cs dot ucla dot edu>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 20:43:34 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <54DBFA8D dot 8030107 at cs dot ucla dot edu>
On 02/11/2015 07:57 PM, Paul Eggert wrote:
> Carlos O'Donell wrote:
>> Is anyone opposed to having glibc contain a builtin C.UTF-8
>> locale? This locale would have the same rules as the C locale when
>> set for LC_ALL.
> In reading followups it seems this point wasn't entirely clear. I
> took it to mean that "C.utf8" is like "C" except with UTF-8 encoding,
> so that (for example) there are only 26 alphabetic characters in
> "C.utf8". This should allow a compact implementation, which uses the
> same (small) character tables for both the C and the C.utf8 locales.
You raise a very interesting alternative solution.
However, it is not what I meant.
When I say "like C" I mean that setting "C.UTF8" in LC_ALL would
ignore LANGUAGE, as is required when setting LC_ALL to "C".
I did not intend my comment to mean that it only supports ASCII
characters in a mostly empty UTF-8 charset. I had not considered
such an interpretation.
> Others, however, seem to be thinking that the new locale would use
> bigger tables that encompass all Unicode characters, so that there
> would be thousands of alphabetic characters. This also sounds
> useful, for applications that need to know whether a character is a
> Unicode letter regardless of language. Many applications, though,
> don't need this extra information, and would work well with the
> more-compact approach.
My argument is that we should provide applications with the bigger
tables as the default.
> This suggests that we add two locales: "C.utf8" could be a minimal
> locale that is as close as possible to the "C" locale while adding
> UTF-8, and "i18n.utf8" could be a bigger locale, basically the i18n
> locale of ISO/IEC TR 30112. The "C.utf8" locale could easily be
> built into glibc for performance, just as "C" is; the "i18n.utf8"
> locale could use tables compiled with localedef like all the other
The problem with that approach is that i18n.utf8 can be removed by
the system administrator and can't be expected to be present. Granted
this requires apriori knowledge about the target system, but it
However, we could do something about this, and make it harder to
*remove* the i18n.utf8 locale by modifying localedef, but it's still
not a hard guarantee.
We could also document that a glibc shipped without i18n.utf8 would
be an unsupported configuration that violates the intentions of the
community to provide this support.