Bug 17318 - [RFE] Provide a C.UTF-8 locale by default
Summary: [RFE] Provide a C.UTF-8 locale by default
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 21302
Blocks: 16621
  Show dependency treegraph
 
Reported: 2014-08-27 12:57 UTC by Nick Coghlan
Modified: 2019-11-12 14:20 UTC (History)
17 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Coghlan 2014-08-27 12:57:32 UTC
Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested that a more appropriate would be for it to be provided as part of upstream glibc, at which point Fedora would inherit it by default.

Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.

My personal interest relates to Python 3, where "LANG=C" misconfigures a few aspects to use ASCII, when they really should be using UTF-8. While I'd actually like to fix that on the Python side in the long run, being able to set "LANG=C.UTF-8" instead is a solution that already works for existing versions of Python 3.

Bug #16621 suggests that C.UTF-8 may actually require special casing in glibc in order to be handled correctly. If that's accurate, then it would strengthen the case for including the locale in the upstream library.
Comment 1 Carlos O'Donell 2015-02-11 15:39:27 UTC
(In reply to Nick Coghlan from comment #0)
> Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting
> it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested
> that a more appropriate would be for it to be provided as part of upstream
> glibc, at which point Fedora would inherit it by default.
> 
> Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.
> 
> My personal interest relates to Python 3, where "LANG=C" misconfigures a few
> aspects to use ASCII, when they really should be using UTF-8. While I'd
> actually like to fix that on the Python side in the long run, being able to
> set "LANG=C.UTF-8" instead is a solution that already works for existing
> versions of Python 3.
> 
> Bug #16621 suggests that C.UTF-8 may actually require special casing in
> glibc in order to be handled correctly. If that's accurate, then it would
> strengthen the case for including the locale in the upstream library.

I agree that this is a good idea. Someone needs to do the work and submit it to libc-alpha. It's not all that easy, and consensus needs to be reached about the inclusion of ~1.5MB of UTF-8 data into the runtime.
Comment 2 Nick Coghlan 2015-02-25 23:02:06 UTC
Reference to the glic-alpha mailing list discussion with additional technical details: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html
Comment 3 Filipe Brandenburger 2018-03-03 23:26:12 UTC
Just wanted to point out that Fedora includes C.UTF-8 since circa 2015...

Patch used by them is here (and, in fact, seems to come from a Red Hat employee who contributes often to glibc):

https://src.fedoraproject.org/rpms/glibc/blob/0457f649e3fe6299efe384da13dfc923bbe65707/f/glibc-c-utf8-locale.patch

The discussion in the e-mail threads was somewhat about *optimizing* C.UTF-8 so that it takes less space... While I think that's great (and very advisable!) I think it's a separate step from starting to *ship* C.UTF-8 by default.

So... ship first, optimize later?

At this point, most major distros seem to be shipping it anyways, so why not include it upstream so that at some point in the near future we know we can count on it on all distros?
Comment 4 Carlos O'Donell 2018-03-05 15:56:34 UTC
(In reply to Filipe Brandenburger from comment #3)
> So... ship first, optimize later?
> 
> At this point, most major distros seem to be shipping it anyways, so why not
> include it upstream so that at some point in the near future we know we can
> count on it on all distros?

The major distros ship a non-functioning C.UTF-8 for the purposes required by upstream. The code-point sorting order requirement fails, and it's not clear why. This is what I'm trying to fix right now.
Comment 5 James Cloos 2019-01-09 08:17:15 UTC
Out of curiosity, what is the code-point sorting order requirement which fails?

Ideally C.UTF-8 would sort exclusively by 10646 code point.  Exactly like C sorts by ascii code point.

Things like scripts or blocks should be ignored.

Currently when using LANG=en_US.UTF-8 I have to add LC_COLLATE=C LC_TIME=C to get reasonable results.
Comment 6 Carlos O'Donell 2019-01-09 16:34:08 UTC
(In reply to James Cloos from comment #5)
> Out of curiosity, what is the code-point sorting order requirement which
> fails?

We must sort by code-point ordering, across *all* code-points available for Unicode, and presently we have bugs in glibc (which we're tracking down) which don't support this. For example I've noted the 3-level tables we use support only 16-bit indexes, but we may overflow this.

In practical terms when I try to collate over all code-points I get errors like:

; <U5> > 􃀡 ; <U103021>

Which is wrong, because in code-point ordering it should be the other way around.
 
> Ideally C.UTF-8 would sort exclusively by 10646 code point.  Exactly like C
> sorts by ascii code point.

We would use the Unicode code-point.
 
> Things like scripts or blocks should be ignored.

No. We want to sort *everything* deterministically forever and never change the results of this collation for as long as say UTF-8 is used.

> Currently when using LANG=en_US.UTF-8 I have to add LC_COLLATE=C LC_TIME=C
> to get reasonable results.

The C collation will cause lots of oddities as ignored collation symbols may have arbitrary weights (really weights determined by order in the source locale files).
Comment 7 James Cloos 2019-01-12 07:55:02 UTC
The initial part of your reply seems to agree with me, but penultimate bit seems to disagree.

Unless we wrote past each other.

To be clear, I advocate the logical equivalent of converting each utf8 to utf32 and ordering them as int32_t.

Does that match your goal?
Comment 8 Carlos O'Donell 2019-01-12 13:40:00 UTC
(In reply to James Cloos from comment #7)
> To be clear, I advocate the logical equivalent of converting each utf8 to
> utf32 and ordering them as int32_t.
> 
> Does that match your goal?

Basically yes. That's what "code-point sorting" means.

To give examples:

U0041 "A" < U0042 "B" < U0391 "Α" < U0392 "Β" < U1F08 "Ἀ" < U1D6A8 "𝚨"

etc.

Does that match your expectations?
Comment 9 James Cloos 2019-01-12 17:45:37 UTC
Perfect.

Can I offer any help in sorting /apologies/ out the issues?

I just tested a file with contents:

	U+0005
􃀡	U+103021
a	U+0061
𐁄	U+10044
7	U+0047

and with:

LANG=en_US.UTF-8
LC_COLLATE=C
LC_CTYPE=en_US.UTF-8
LC_TIME=C

it sorted correctly using coreutils' sort(1).

I've never noticed any anomalies with those env settings.

(I use LC_COLLATE=C to avoid caseless collation.  I reported the bug that en_US made 'rm [a-z]*' remove files like README and Makefile twice now.  Ulrich said he didn't care (should be on the mailing list from back in the '90s) and someone recently said the bug has been around too long to fix (in 2018)....)
Comment 10 Carlos O'Donell 2019-01-14 14:43:41 UTC
(In reply to James Cloos from comment #9)
> Perfect.
> 
> Can I offer any help in sorting /apologies/ out the issues?
> 
> I just tested a file with contents:
> 
> 	U+0005
> 􃀡	U+103021
> a	U+0061
> 𐁄	U+10044
> 7	U+0047
> 
> and with:
> 
> LANG=en_US.UTF-8
> LC_COLLATE=C
> LC_CTYPE=en_US.UTF-8
> LC_TIME=C
> 
> it sorted correctly using coreutils' sort(1).
> 
> I've never noticed any anomalies with those env settings.

You should use your distribution C.UTF-8, and attempt to use strcoll_l to sort *every* code point and look at the results. You'll see they don't sort correctly.
 
> (I use LC_COLLATE=C to avoid caseless collation.  I reported the bug that
> en_US made 'rm [a-z]*' remove files like README and Makefile twice now. 
> Ulrich said he didn't care (should be on the mailing list from back in the
> '90s) and someone recently said the bug has been around too long to fix (in
> 2018)....)

The problem is that [a-z]* in most locales uses Collation Element Ordering, and for many locales this means 'aAbBcC...zZ' and so README is removed. This is just the way it works and it is not a bug, it is a function of the definition of the standard. If you want to avoid this you must use LC_COLLATE=C or LC_COLLATE=C.UTF-8 (for UTF-8 support). Right now there is no glibc official C.UTF-8 locale and that's what we're trying to implement with this bug.
Comment 11 benjaminmoody 2019-02-08 21:03:55 UTC
Something about this doesn't make sense to me.

Sorting Unicode strings by code point is exactly the same as sorting their UTF-8 representations by unsigned byte value; that's a major part of the reasoning behind UTF-8.

So, naively, one would expect that strcoll/C.UTF8 == strcoll/C == strcmp, and strxfrm/C.UTF-8 == strxfrm/C == strlcpy.

Yet that seems not to be the case; on Debian 9, in the C.UTF-8 locale, strxfrm of "abcd" yields "cdef".  And (unlike the C locale) strcoll is many times slower than strcmp, and strxfrm is many times slower than strlcpy.

What's the difference?  Is there a requirement somewhere that strcoll must guard against invalid multibyte sequences, or that strxfrm's output must be a valid multibyte string?  Are there particular invalid UTF-8 sequences that, for some reason, *need* to be collated in a particular way?

If not, there's no reason for collation to be non-trivial in UTF-8.
Comment 12 Florian Weimer 2019-02-08 21:40:15 UTC
(In reply to benjaminmoody from comment #11)
> Something about this doesn't make sense to me.
> 
> Sorting Unicode strings by code point is exactly the same as sorting their
> UTF-8 representations by unsigned byte value; that's a major part of the
> reasoning behind UTF-8.
> 
> So, naively, one would expect that strcoll/C.UTF8 == strcoll/C == strcmp,
> and strxfrm/C.UTF-8 == strxfrm/C == strlcpy.
> 
> Yet that seems not to be the case; on Debian 9, in the C.UTF-8 locale,
> strxfrm of "abcd" yields "cdef".  And (unlike the C locale) strcoll is many
> times slower than strcmp, and strxfrm is many times slower than strlcpy.
> 
> What's the difference?  Is there a requirement somewhere that strcoll must
> guard against invalid multibyte sequences, or that strxfrm's output must be
> a valid multibyte string?  Are there particular invalid UTF-8 sequences
> that, for some reason, *need* to be collated in a particular way?

Currently, there are very few fast paths for UTF-8 in glibc (if any).  The multi-byte and wide character handling uses the generic (code-driven) gconv interfaces, or tables for collation and character classification which are not particularly efficient for today's requirements.

The nl_langinfo interface also provides access to a few tables that need to reflect collation (several entries under _NL_COLLATE_*).  The format of these tables, while undocumented, is part of the ABI which we cannot change.  Unfortunately, this format is rather hostile to representing the full 20.6-bit Unicode range.
Comment 13 Carlos O'Donell 2019-11-12 14:20:53 UTC
When implementing this we should consider having C.UTF-8 as builtin, and if we don't do this we should file another bug to make it builtin since that would simplify a lot of code and possibly make things faster.