This is the mail archive of the
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 15:27:10 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211173024 dot GZ23507 at brightrain dot aerifal dot cx> <54DB93E2 dot 8000106 at redhat dot com> <20150211194035 dot GH23507 at brightrain dot aerifal dot cx>
On 02/11/2015 02:40 PM, Rich Felker wrote:
> On Wed, Feb 11, 2015 at 12:39:46PM -0500, Carlos O'Donell wrote:
>> On 02/11/2015 12:30 PM, Rich Felker wrote:
>>> On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
>>>> Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
>>>> This locale would have the same rules as the C locale when set for
>>>> The locale would provide sensible fallback for developers that need
>>>> UTF-8 but until C.UTF-8 was provided, could not rely upon it.
>>>> My best guess is that it will take ~1.5MB of data to include the
>>>> UTF-8 locale in the runtime. If you do it right this is shared
>>>> for all processes, and give you, in this the 20th century, a fallback
>>>> that is sensible for all developers of all languages.
>>>> We have had on-and-off requests for this for years as UTF-8 has become
>>>> the defacto standard.
>>>> The most recent request is from the Python 3 folks who want to be able
>>>> to assume there is some kind of UTF-8 support in the system regardless
>>>> of the installed locales.
>>>> Is this the right way forward? Or should we tell the distributions
>>>> that it is their responsibility to ship and always provide a C.UTF-8?
>>> I'm highly in favor of this, but I wonder why it requires so much
>>> data. Am I correct in assuming that's for case mappings and character
>>> classes? How would static linking be affected? It's possible to
>>> represent this data in much smaller size, -- it's about 8k in musl --
>>> but doing so requires significantly different data structures from
>>> what glibc uses, and the case-mapping is significantly slower than
>>> what some users would like/expect. But perhaps there's some middle
>>> ground and a way glibc could represent its C.UTF-8 locale without the
>>> full weight you're looking at.
>> It seems like two projects. First add the locale. Then optimize it.
>> It would make statically linked executables larger, but there would be
>> little performance difference at runtime except that the DSO would
>> take up more disk space and thus take slightly longer to load at startup.
>> This is all just rough back-of-the-envelope.
> If you're looking at adding 1.5 MB of data to static executables, that
> increases the typical size of a small static-linked utility from ~600k
> to ~2MB. This could be extreme for people using glibc in embedded
> environments. I suspect static glibc-linked busybox would increase
> from ~1MB to ~2.5MB.
> I'm all for this project going forward, but what I don't want to see
> is backlash against UTF-8 (and universal support for it, which I'm
> excited to see glibc pushing) because "It made glibc 4x more bloated!"
> This happened in the past (albeit wrt speed rather than performance)
> when GNU grep got 100x slower in UTF-8 locales.
> If you do go with the "two projects" approach, perhaps you could aim
> to have them both take place in the same release cycle, or else to
> have "built-in C.UTF-8 locale" be an optional feature until it's
> optimized in a subsequent release.
No, you raise a very good point. I'll make sure they go forward as one
project with an analysis phase that requires looking at how to reduce
the tables sizes.