This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Should glibc provide a builtin C.UTF-8 locale?
- From: "Carlos O'Donell" <carlos at redhat dot com>
- To: Rich Felker <dalias at libc dot org>
- Cc: GNU C Library <libc-alpha at sourceware dot org>
- Date: Wed, 11 Feb 2015 12:39:46 -0500
- Subject: Re: Should glibc provide a builtin C.UTF-8 locale?
- Authentication-results: sourceware.org; auth=none
- References: <54DB8243 dot 3050903 at redhat dot com> <20150211173024 dot GZ23507 at brightrain dot aerifal dot cx>
On 02/11/2015 12:30 PM, Rich Felker wrote:
> On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
>> Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
>> This locale would have the same rules as the C locale when set for
>> LC_ALL.
>>
>> The locale would provide sensible fallback for developers that need
>> UTF-8 but until C.UTF-8 was provided, could not rely upon it.
>>
>> My best guess is that it will take ~1.5MB of data to include the
>> UTF-8 locale in the runtime. If you do it right this is shared
>> for all processes, and give you, in this the 20th century, a fallback
>> that is sensible for all developers of all languages.
>>
>> We have had on-and-off requests for this for years as UTF-8 has become
>> the defacto standard.
>>
>> The most recent request is from the Python 3 folks who want to be able
>> to assume there is some kind of UTF-8 support in the system regardless
>> of the installed locales.
>>
>> Is this the right way forward? Or should we tell the distributions
>> that it is their responsibility to ship and always provide a C.UTF-8?
>>
>> Comments?
>
> I'm highly in favor of this, but I wonder why it requires so much
> data. Am I correct in assuming that's for case mappings and character
> classes? How would static linking be affected? It's possible to
> represent this data in much smaller size, -- it's about 8k in musl --
> but doing so requires significantly different data structures from
> what glibc uses, and the case-mapping is significantly slower than
> what some users would like/expect. But perhaps there's some middle
> ground and a way glibc could represent its C.UTF-8 locale without the
> full weight you're looking at.
It seems like two projects. First add the locale. Then optimize it.
It would make statically linked executables larger, but there would be
little performance difference at runtime except that the DSO would
take up more disk space and thus take slightly longer to load at startup.
This is all just rough back-of-the-envelope.
Cheers,
Carlos.