This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Should glibc provide a builtin C.UTF-8 locale?


On Thu, Feb 12, 2015 at 12:53:04AM +0100, keld@keldix.com wrote:
> On Wed, Feb 11, 2015 at 11:24:35AM -0500, Carlos O'Donell wrote:
> > Is anyone opposed to having glibc contain a builtin C.UTF-8 locale?
> > This locale would have the same rules as the C locale when set for
> > LC_ALL.
> > 
> > The locale would provide sensible fallback for developers that need
> > UTF-8 but until C.UTF-8 was provided, could not rely upon it.
> > 
> > My best guess is that it will take ~1.5MB of data to include the
> > UTF-8 locale in the runtime. If you do it right this is shared
> > for all processes, and give you, in this the 20th century, a fallback
> > that is sensible for all developers of all languages.
> > 
> > We have had on-and-off requests for this for years as UTF-8 has become
> > the defacto standard.
> > 
> > The most recent request is from the Python 3 folks who want to be able
> > to assume there is some kind of UTF-8 support in the system regardless
> > of the installed locales.
> > 
> > Is this the right way forward? Or should we tell the distributions
> > that it is their responsibility to ship and always provide a C.UTF-8?
> 
> I think it is a good way forward. It should probably be the "i18n" locale
> of ISO 30112 that is the base, the "i18n" locale is built directly on glibc data.
> 
> A lot of optimisation could be done on the data with two-level or more tables,
> giving special data where the data is not well-formed for algoritmic 
> handling, noting that some case mappings are not suited for algoritmic handling.
> Some properties tables are well suited for combined bitmap handling and index handling.
> Collating tables could possibly also be optimized by multilevel tables.
> 
> Also a pet idea of mine is to have compressed locales - that could significantly reduce
> the disk footprint of a more complete locale database. Also good for message catalogues.

This sounds like a bad tradeoff unless you can use the compressed data
efficiently in-place. Disk space is cheap; requiring a decompressed
copy in memory per-process rather than using a shared mapping is
expensive.

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]