Bug 17318 - [RFE] Provide a C.UTF-8 locale by default
Summary: [RFE] Provide a C.UTF-8 locale by default
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 21302
Blocks: 16621
  Show dependency treegraph
 
Reported: 2014-08-27 12:57 UTC by Nick Coghlan
Modified: 2018-06-13 08:02 UTC (History)
14 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Coghlan 2014-08-27 12:57:32 UTC
Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested that a more appropriate would be for it to be provided as part of upstream glibc, at which point Fedora would inherit it by default.

Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.

My personal interest relates to Python 3, where "LANG=C" misconfigures a few aspects to use ASCII, when they really should be using UTF-8. While I'd actually like to fix that on the Python side in the long run, being able to set "LANG=C.UTF-8" instead is a solution that already works for existing versions of Python 3.

Bug #16621 suggests that C.UTF-8 may actually require special casing in glibc in order to be handled correctly. If that's accurate, then it would strengthen the case for including the locale in the upstream library.
Comment 1 Carlos O'Donell 2015-02-11 15:39:27 UTC
(In reply to Nick Coghlan from comment #0)
> Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting
> it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested
> that a more appropriate would be for it to be provided as part of upstream
> glibc, at which point Fedora would inherit it by default.
> 
> Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.
> 
> My personal interest relates to Python 3, where "LANG=C" misconfigures a few
> aspects to use ASCII, when they really should be using UTF-8. While I'd
> actually like to fix that on the Python side in the long run, being able to
> set "LANG=C.UTF-8" instead is a solution that already works for existing
> versions of Python 3.
> 
> Bug #16621 suggests that C.UTF-8 may actually require special casing in
> glibc in order to be handled correctly. If that's accurate, then it would
> strengthen the case for including the locale in the upstream library.

I agree that this is a good idea. Someone needs to do the work and submit it to libc-alpha. It's not all that easy, and consensus needs to be reached about the inclusion of ~1.5MB of UTF-8 data into the runtime.
Comment 2 Nick Coghlan 2015-02-25 23:02:06 UTC
Reference to the glic-alpha mailing list discussion with additional technical details: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html
Comment 3 Filipe Brandenburger 2018-03-03 23:26:12 UTC
Just wanted to point out that Fedora includes C.UTF-8 since circa 2015...

Patch used by them is here (and, in fact, seems to come from a Red Hat employee who contributes often to glibc):

https://src.fedoraproject.org/rpms/glibc/blob/0457f649e3fe6299efe384da13dfc923bbe65707/f/glibc-c-utf8-locale.patch

The discussion in the e-mail threads was somewhat about *optimizing* C.UTF-8 so that it takes less space... While I think that's great (and very advisable!) I think it's a separate step from starting to *ship* C.UTF-8 by default.

So... ship first, optimize later?

At this point, most major distros seem to be shipping it anyways, so why not include it upstream so that at some point in the near future we know we can count on it on all distros?
Comment 4 Carlos O'Donell 2018-03-05 15:56:34 UTC
(In reply to Filipe Brandenburger from comment #3)
> So... ship first, optimize later?
> 
> At this point, most major distros seem to be shipping it anyways, so why not
> include it upstream so that at some point in the near future we know we can
> count on it on all distros?

The major distros ship a non-functioning C.UTF-8 for the purposes required by upstream. The code-point sorting order requirement fails, and it's not clear why. This is what I'm trying to fix right now.