This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 14:40:35 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211173024 dot GZ23507 at brightrain dot aerifal dot cx> <54DB93E2 dot 8000106 at redhat dot com>
On Wed, Feb 11, 2015 at 12:39:46PM -0500, Carlos O'Donell wrote:
> On 02/11/2015 12:30 PM, Rich Felker wrote:
> > On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> >> Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> >> This locale would have the same rules as the C locale when set for
> >> LC_ALL.
> >> The locale would provide sensible fallback for developers that need
> >> UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> >> My best guess is that it will take ~1.5MB of data to include the
> >> UTF-8 locale in the runtime. If you do it right this is shared
> >> for all processes, and give you, in this the 20th century, a fallback
> >> that is sensible for all developers of all languages.
> >> We have had on-and-off requests for this for years as UTF-8 has become
> >> the defacto standard.
> >> The most recent request is from the Python 3 folks who want to be able
> >> to assume there is some kind of UTF-8 support in the system regardless
> >> of the installed locales.
> >> Is this the right way forward? Or should we tell the distributions
> >> that it is their responsibility to ship and always provide a C.UTF-8?
> >> Comments?
> > I'm highly in favor of this, but I wonder why it requires so much
> > data. Am I correct in assuming that's for case mappings and character
> > classes? How would static linking be affected? It's possible to
> > represent this data in much smaller size, -- it's about 8k in musl --
> > but doing so requires significantly different data structures from
> > what glibc uses, and the case-mapping is significantly slower than
> > what some users would like/expect. But perhaps there's some middle
> > ground and a way glibc could represent its C.UTF-8 locale without the
> > full weight you're looking at.
> It seems like two projects. First add the locale. Then optimize it.
> It would make statically linked executables larger, but there would be
> little performance difference at runtime except that the DSO would
> take up more disk space and thus take slightly longer to load at startup.
> This is all just rough back-of-the-envelope.
If you're looking at adding 1.5 MB of data to static executables, that
increases the typical size of a small static-linked utility from ~600k
to ~2MB. This could be extreme for people using glibc in embedded
environments. I suspect static glibc-linked busybox would increase
from ~1MB to ~2.5MB.
I'm all for this project going forward, but what I don't want to see
is backlash against UTF-8 (and universal support for it, which I'm
excited to see glibc pushing) because "It made glibc 4x more bloated!"
This happened in the past (albeit wrt speed rather than performance)
when GNU grep got 100x slower in UTF-8 locales.
If you do go with the "two projects" approach, perhaps you could aim
to have them both take place in the same release cycle, or else to
have "built-in C.UTF-8 locale" be an optional feature until it's
optimized in a subsequent release.