This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: Rich Felker <dalias at libc dot org>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 12:30:24 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com>
On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> This locale would have the same rules as the C locale when set for
> LC_ALL.
>
> The locale would provide sensible fallback for developers that need
> UTF-8 but until C.UTF-8 was provided, could not rely upon it.
>
> My best guess is that it will take ~1.5MB of data to include the
> UTF-8 locale in the runtime. If you do it right this is shared
> for all processes, and give you, in this the 20th century, a fallback
> that is sensible for all developers of all languages.
>
> We have had on-and-off requests for this for years as UTF-8 has become
> the defacto standard.
>
> The most recent request is from the Python 3 folks who want to be able
> to assume there is some kind of UTF-8 support in the system regardless
> of the installed locales.
>
> Is this the right way forward? Or should we tell the distributions
> that it is their responsibility to ship and always provide a C.UTF-8?
>
> Comments?
I'm highly in favor of this, but I wonder why it requires so much
data. Am I correct in assuming that's for case mappings and character
classes? How would static linking be affected? It's possible to
represent this data in much smaller size, -- it's about 8k in musl --
but doing so requires significantly different data structures from
what glibc uses, and the case-mapping is significantly slower than
what some users would like/expect. But perhaps there's some middle
ground and a way glibc could represent its C.UTF-8 locale without the
full weight you're looking at.
Rich