This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: keld at keldix dot com
- To: Rich Felker <dalias at libc dot org>
- Cc: Carlos O'Donell <carlos at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
- Date: Thu, 12 Feb 2015 07:38:40 +0100
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211235304 dot GA20330 at www5 dot open-std dot org> <20150212023923 dot GP23507 at brightrain dot aerifal dot cx>
On Wed, Feb 11, 2015 at 09:39:23PM -0500, Rich Felker wrote:
> On Thu, Feb 12, 2015 at 12:53:04AM +0100, email@example.com wrote:
> > On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> > > Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> > > This locale would have the same rules as the C locale when set for
> > > LC_ALL.
> > >
> > > The locale would provide sensible fallback for developers that need
> > > UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> > >
> > > My best guess is that it will take ~1.5MB of data to include the
> > > UTF-8 locale in the runtime. If you do it right this is shared
> > > for all processes, and give you, in this the 20th century, a fallback
> > > that is sensible for all developers of all languages.
> > >
> > > We have had on-and-off requests for this for years as UTF-8 has become
> > > the defacto standard.
> > >
> > > The most recent request is from the Python 3 folks who want to be able
> > > to assume there is some kind of UTF-8 support in the system regardless
> > > of the installed locales.
> > >
> > > Is this the right way forward? Or should we tell the distributions
> > > that it is their responsibility to ship and always provide a C.UTF-8?
> > I think it is a good way forward. It should probably be the "i18n" locale
> > of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
> > A lot of optimisation could be done on the data with two-level or more tables,
> > giving special data where the data is not well-formed for algoritmic
> > handling, noting that some case mappings are not suited for algoritmic handling.
> > Some properties tables are well suited for combined bitmap handling and index handling.
> > Collating tables could possibly also be optimized by multilevel tables.
> > Also a pet idea of mine is to have compressed locales - that could significantly reduce
> > the disk footprint of a more complete locale database. Also good for message catalogues.
> This sounds like a bad tradeoff unless you can use the compressed data
> efficiently in-place. Disk space is cheap; requiring a decompressed
> copy in memory per-process rather than using a shared mapping is
Hmm, are you referring to a statically linked version in glibc when you talk about
a shared mapping?
I do not see the big difference between loading an uncompressed locale and loading
a compressed locale into memory, it may even be faster to read the compressed data
and uncompress it. Or what?
Message catalogues may be huge, especially if you want to carry them all.