This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: keld at keldix dot com
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 12 Feb 2015 00:53:04 +0100
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com>
On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> This locale would have the same rules as the C locale when set for
> The locale would provide sensible fallback for developers that need
> UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> My best guess is that it will take ~1.5MB of data to include the
> UTF-8 locale in the runtime. If you do it right this is shared
> for all processes, and give you, in this the 20th century, a fallback
> that is sensible for all developers of all languages.
> We have had on-and-off requests for this for years as UTF-8 has become
> the defacto standard.
> The most recent request is from the Python 3 folks who want to be able
> to assume there is some kind of UTF-8 support in the system regardless
> of the installed locales.
> Is this the right way forward? Or should we tell the distributions
> that it is their responsibility to ship and always provide a C.UTF-8?
I think it is a good way forward. It should probably be the "i18n" locale
of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
A lot of optimisation could be done on the data with two-level or more tables,
giving special data where the data is not well-formed for algoritmic
handling, noting that some case mappings are not suited for algoritmic handling.
Some properties tables are well suited for combined bitmap handling and index handling.
Collating tables could possibly also be optimized by multilevel tables.
Also a pet idea of mine is to have compressed locales - that could significantly reduce
the disk footprint of a more complete locale database. Also good for message catalogues.