17318 – [RFE] Provide a C.UTF-8 locale by default

Bug 17318 - [RFE] Provide a C.UTF-8 locale by default

Summary: [RFE] Provide a C.UTF-8 locale by default

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	locale (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	2.35
Assignee:	Carlos O'Donell

URL:
Keywords:

Duplicates (1):	21302 (view as bug list)
Depends on:	21302
Blocks:	16621
	Show dependency tree / graph

Reported:	2014-08-27 12:57 UTC by Nick Coghlan
Modified:	2021-12-16 21:50 UTC (History)
CC List:	21 users (show)

See Also:	https://bugzilla.redhat.com/show_bug.cgi?id=902094 https://bugzilla.redhat.com/show_bug.cgi?id=1361965 https://bugzilla.redhat.com/show_bug.cgi?id=1313818
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
0001-Add-new-C.UTF-8-locale.patch.gz (1.16 MB, application/octet-stream) 2020-06-21 14:09 UTC, Carlos O'Donell	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nick Coghlan 2014-08-27 12:57:32 UTC

Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested that a more appropriate would be for it to be provided as part of upstream glibc, at which point Fedora would inherit it by default.

Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.

My personal interest relates to Python 3, where "LANG=C" misconfigures a few aspects to use ASCII, when they really should be using UTF-8. While I'd actually like to fix that on the Python side in the long run, being able to set "LANG=C.UTF-8" instead is a solution that already works for existing versions of Python 3.

Bug #16621 suggests that C.UTF-8 may actually require special casing in glibc in order to be handled correctly. If that's accurate, then it would strengthen the case for including the locale in the upstream library.

Comment 1 Carlos O'Donell 2015-02-11 15:39:27 UTC

(In reply to Nick Coghlan from comment #0)
> Fedora doesn't currently provide the C.UTF-8 locale. In the RFE requesting
> it (https://bugzilla.redhat.com/show_bug.cgi?id=902094), it was suggested
> that a more appropriate would be for it to be provided as part of upstream
> glibc, at which point Fedora would inherit it by default.
> 
> Hence, this RFE to request the inclusion of a C.UTF-8 locale by default.
> 
> My personal interest relates to Python 3, where "LANG=C" misconfigures a few
> aspects to use ASCII, when they really should be using UTF-8. While I'd
> actually like to fix that on the Python side in the long run, being able to
> set "LANG=C.UTF-8" instead is a solution that already works for existing
> versions of Python 3.
> 
> Bug #16621 suggests that C.UTF-8 may actually require special casing in
> glibc in order to be handled correctly. If that's accurate, then it would
> strengthen the case for including the locale in the upstream library.

I agree that this is a good idea. Someone needs to do the work and submit it to libc-alpha. It's not all that easy, and consensus needs to be reached about the inclusion of ~1.5MB of UTF-8 data into the runtime.

Comment 2 Nick Coghlan 2015-02-25 23:02:06 UTC

Reference to the glic-alpha mailing list discussion with additional technical details: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html

Comment 3 Filipe Brandenburger 2018-03-03 23:26:12 UTC

Just wanted to point out that Fedora includes C.UTF-8 since circa 2015...

Patch used by them is here (and, in fact, seems to come from a Red Hat employee who contributes often to glibc):

https://src.fedoraproject.org/rpms/glibc/blob/0457f649e3fe6299efe384da13dfc923bbe65707/f/glibc-c-utf8-locale.patch

The discussion in the e-mail threads was somewhat about *optimizing* C.UTF-8 so that it takes less space... While I think that's great (and very advisable!) I think it's a separate step from starting to *ship* C.UTF-8 by default.

So... ship first, optimize later?

At this point, most major distros seem to be shipping it anyways, so why not include it upstream so that at some point in the near future we know we can count on it on all distros?

Comment 4 Carlos O'Donell 2018-03-05 15:56:34 UTC

(In reply to Filipe Brandenburger from comment #3)
> So... ship first, optimize later?
> 
> At this point, most major distros seem to be shipping it anyways, so why not
> include it upstream so that at some point in the near future we know we can
> count on it on all distros?

The major distros ship a non-functioning C.UTF-8 for the purposes required by upstream. The code-point sorting order requirement fails, and it's not clear why. This is what I'm trying to fix right now.

Comment 5 James Cloos 2019-01-09 08:17:15 UTC

Out of curiosity, what is the code-point sorting order requirement which fails?

Ideally C.UTF-8 would sort exclusively by 10646 code point.  Exactly like C sorts by ascii code point.

Things like scripts or blocks should be ignored.

Currently when using LANG=en_US.UTF-8 I have to add LC_COLLATE=C LC_TIME=C to get reasonable results.

Comment 6 Carlos O'Donell 2019-01-09 16:34:08 UTC

(In reply to James Cloos from comment #5)
> Out of curiosity, what is the code-point sorting order requirement which
> fails?

We must sort by code-point ordering, across *all* code-points available for Unicode, and presently we have bugs in glibc (which we're tracking down) which don't support this. For example I've noted the 3-level tables we use support only 16-bit indexes, but we may overflow this.

In practical terms when I try to collate over all code-points I get errors like:

; <U5> > 􃀡 ; <U103021>

Which is wrong, because in code-point ordering it should be the other way around.
 
> Ideally C.UTF-8 would sort exclusively by 10646 code point.  Exactly like C
> sorts by ascii code point.

We would use the Unicode code-point.
 
> Things like scripts or blocks should be ignored.

No. We want to sort *everything* deterministically forever and never change the results of this collation for as long as say UTF-8 is used.

> Currently when using LANG=en_US.UTF-8 I have to add LC_COLLATE=C LC_TIME=C
> to get reasonable results.

The C collation will cause lots of oddities as ignored collation symbols may have arbitrary weights (really weights determined by order in the source locale files).

Comment 7 James Cloos 2019-01-12 07:55:02 UTC

The initial part of your reply seems to agree with me, but penultimate bit seems to disagree.

Unless we wrote past each other.

To be clear, I advocate the logical equivalent of converting each utf8 to utf32 and ordering them as int32_t.

Does that match your goal?

Comment 8 Carlos O'Donell 2019-01-12 13:40:00 UTC

(In reply to James Cloos from comment #7)
> To be clear, I advocate the logical equivalent of converting each utf8 to
> utf32 and ordering them as int32_t.
> 
> Does that match your goal?

Basically yes. That's what "code-point sorting" means.

To give examples:

U0041 "A" < U0042 "B" < U0391 "Α" < U0392 "Β" < U1F08 "Ἀ" < U1D6A8 "𝚨"

etc.

Does that match your expectations?

Comment 9 James Cloos 2019-01-12 17:45:37 UTC

Perfect.

Can I offer any help in sorting /apologies/ out the issues?

I just tested a file with contents:

	U+0005
􃀡	U+103021
a	U+0061
𐁄	U+10044
7	U+0047

and with:

LANG=en_US.UTF-8
LC_COLLATE=C
LC_CTYPE=en_US.UTF-8
LC_TIME=C

it sorted correctly using coreutils' sort(1).

I've never noticed any anomalies with those env settings.

(I use LC_COLLATE=C to avoid caseless collation.  I reported the bug that en_US made 'rm [a-z]*' remove files like README and Makefile twice now.  Ulrich said he didn't care (should be on the mailing list from back in the '90s) and someone recently said the bug has been around too long to fix (in 2018)....)

Comment 10 Carlos O'Donell 2019-01-14 14:43:41 UTC

(In reply to James Cloos from comment #9)
> Perfect.
> 
> Can I offer any help in sorting /apologies/ out the issues?
> 
> I just tested a file with contents:
> 
> 	U+0005
> 􃀡	U+103021
> a	U+0061
> 𐁄	U+10044
> 7	U+0047
> 
> and with:
> 
> LANG=en_US.UTF-8
> LC_COLLATE=C
> LC_CTYPE=en_US.UTF-8
> LC_TIME=C
> 
> it sorted correctly using coreutils' sort(1).
> 
> I've never noticed any anomalies with those env settings.

You should use your distribution C.UTF-8, and attempt to use strcoll_l to sort *every* code point and look at the results. You'll see they don't sort correctly.
 
> (I use LC_COLLATE=C to avoid caseless collation.  I reported the bug that
> en_US made 'rm [a-z]*' remove files like README and Makefile twice now. 
> Ulrich said he didn't care (should be on the mailing list from back in the
> '90s) and someone recently said the bug has been around too long to fix (in
> 2018)....)

The problem is that [a-z]* in most locales uses Collation Element Ordering, and for many locales this means 'aAbBcC...zZ' and so README is removed. This is just the way it works and it is not a bug, it is a function of the definition of the standard. If you want to avoid this you must use LC_COLLATE=C or LC_COLLATE=C.UTF-8 (for UTF-8 support). Right now there is no glibc official C.UTF-8 locale and that's what we're trying to implement with this bug.

Comment 11 benjaminmoody 2019-02-08 21:03:55 UTC

Something about this doesn't make sense to me.

Sorting Unicode strings by code point is exactly the same as sorting their UTF-8 representations by unsigned byte value; that's a major part of the reasoning behind UTF-8.

So, naively, one would expect that strcoll/C.UTF8 == strcoll/C == strcmp, and strxfrm/C.UTF-8 == strxfrm/C == strlcpy.

Yet that seems not to be the case; on Debian 9, in the C.UTF-8 locale, strxfrm of "abcd" yields "cdef".  And (unlike the C locale) strcoll is many times slower than strcmp, and strxfrm is many times slower than strlcpy.

What's the difference?  Is there a requirement somewhere that strcoll must guard against invalid multibyte sequences, or that strxfrm's output must be a valid multibyte string?  Are there particular invalid UTF-8 sequences that, for some reason, *need* to be collated in a particular way?

If not, there's no reason for collation to be non-trivial in UTF-8.

Comment 12 Florian Weimer 2019-02-08 21:40:15 UTC

(In reply to benjaminmoody from comment #11)
> Something about this doesn't make sense to me.
> 
> Sorting Unicode strings by code point is exactly the same as sorting their
> UTF-8 representations by unsigned byte value; that's a major part of the
> reasoning behind UTF-8.
> 
> So, naively, one would expect that strcoll/C.UTF8 == strcoll/C == strcmp,
> and strxfrm/C.UTF-8 == strxfrm/C == strlcpy.
> 
> Yet that seems not to be the case; on Debian 9, in the C.UTF-8 locale,
> strxfrm of "abcd" yields "cdef".  And (unlike the C locale) strcoll is many
> times slower than strcmp, and strxfrm is many times slower than strlcpy.
> 
> What's the difference?  Is there a requirement somewhere that strcoll must
> guard against invalid multibyte sequences, or that strxfrm's output must be
> a valid multibyte string?  Are there particular invalid UTF-8 sequences
> that, for some reason, *need* to be collated in a particular way?

Currently, there are very few fast paths for UTF-8 in glibc (if any).  The multi-byte and wide character handling uses the generic (code-driven) gconv interfaces, or tables for collation and character classification which are not particularly efficient for today's requirements.

The nl_langinfo interface also provides access to a few tables that need to reflect collation (several entries under _NL_COLLATE_*).  The format of these tables, while undocumented, is part of the ABI which we cannot change.  Unfortunately, this format is rather hostile to representing the full 20.6-bit Unicode range.

Comment 13 Carlos O'Donell 2019-11-12 14:20:53 UTC

When implementing this we should consider having C.UTF-8 as builtin, and if we don't do this we should file another bug to make it builtin since that would simplify a lot of code and possibly make things faster.

Comment 14 Carlos O'Donell 2020-06-21 14:09:28 UTC

Created attachment 12631 [details]
0001-Add-new-C.UTF-8-locale.patch.gz

Patch implementing new C.UTF-8.

Comment 15 Carlos O'Donell 2020-06-21 14:13:15 UTC

Posted RFC for fixed C-UTF-8:
https://sourceware.org/pipermail/libc-alpha/2020-June/115224.html

The problem is with the fix it's a 28MiB locale :-(

Comment 16 Carlos O'Donell 2020-08-18 14:46:48 UTC

(In reply to Carlos O'Donell from comment #15)
> Posted RFC for fixed C-UTF-8:
> https://sourceware.org/pipermail/libc-alpha/2020-June/115224.html
> 
> The problem is with the fix it's a 28MiB locale :-(

I am working on reducing this to the point where the C.UTF-8 locale is as-fast and as small as the C locale with regards to sorting. Since UTF-8 is sortable via strcmp given the prefix ordering of the bytes, we should be able to bypass the locale collation table code for UTF-8.

Comment 17 Carlos O'Donell 2020-08-18 15:35:13 UTC

(In reply to Carlos O'Donell from comment #16)
> (In reply to Carlos O'Donell from comment #15)
> > Posted RFC for fixed C-UTF-8:
> > https://sourceware.org/pipermail/libc-alpha/2020-June/115224.html
> > 
> > The problem is with the fix it's a 28MiB locale :-(
> 
> I am working on reducing this to the point where the C.UTF-8 locale is
> as-fast and as small as the C locale with regards to sorting. Since UTF-8 is
> sortable via strcmp given the prefix ordering of the bytes, we should be
> able to bypass the locale collation table code for UTF-8.

Note that without splitting the C.UTF-8 locale LC_COLLATION into strict byte boundary crossings the higher code point sorting is broken in Fedora and other distributions. We will fix this when we fix the charmap parser to behave sensibly when generating byte ranges for UTF-8 (instead of following the POSIX rule blindly, which is not required).

Comment 18 accounts+sourceware 2020-11-28 16:19:34 UTC

Is there anything a bystander (who needs it for Arch Linux) can do to help?

Comment 19 Vincent Lefèvre 2020-11-28 17:55:19 UTC

Note that Debian provides a C.UTF-8 locale (this is not the same patch), but this breaks iconv:

$ export LC_ALL=C.UTF-8
$ echo a─b | iconv -f utf-8 -t ascii//TRANSLIT
aiconv: illegal input sequence at position 1

where the "─" in "a─b" is U+2500 BOX DRAWINGS LIGHT HORIZONTAL. This error is specific to C.UTF-8 (for instance, LC_ALL=C and LC_ALL=en_US.utf8 are both fine). My bug report:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=973647

So my question: With the proposed patch here, does iconv work correctly?

Comment 20 Florian Weimer 2020-11-30 11:49:24 UTC

(In reply to Vincent Lefèvre from comment #19)
> Note that Debian provides a C.UTF-8 locale (this is not the same patch), but
> this breaks iconv:
> 
> $ export LC_ALL=C.UTF-8
> $ echo a─b | iconv -f utf-8 -t ascii//TRANSLIT
> aiconv: illegal input sequence at position 1
> 
> where the "─" in "a─b" is U+2500 BOX DRAWINGS LIGHT HORIZONTAL. This error
> is specific to C.UTF-8 (for instance, LC_ALL=C and LC_ALL=en_US.utf8 are
> both fine). My bug report:
> 
>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=973647
> 
> So my question: With the proposed patch here, does iconv work correctly?

Fedora's C.UTF-8 locale does not have this particular problem.

I have not reviewed the patch posted here, but an upstream implementation will have basic transliteration support as well, so this should not be a problem.

Comment 21 Matěj Cepl 2020-11-30 12:49:16 UTC

(In reply to Florian Weimer from comment #20)
> Fedora's C.UTF-8 locale does not have this particular problem.

Neither I see it on OpenSUSE:

~@stitny$ export LC_ALL=C.UTF-8
~@stitny$ locale
LANG=cs_CZ.UTF-8
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=C.UTF-8
~@stitny$ echo a─b | iconv -f utf-8 -t ascii//TRANSLIT
a-b
~@stitny$

Comment 22 Vincent Lefèvre 2020-11-30 13:06:32 UTC

(In reply to Florian Weimer from comment #20)
> I have not reviewed the patch posted here, but an upstream implementation
> will have basic transliteration support as well, so this should not be a
> problem.

Note that there is also basic transliteration in the Debian patch. The issue is just that when iconv runs under C.UTF-8, it fails on some characters. This even succeeds with a non-existing locale, such as:

  echo a─b | LC_ALL=foo /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT

However, it may be a more general issue with iconv since I've just noticed that the behavior on "é" in the C locale is buggy too.

Comment 23 Matěj Cepl 2020-11-30 14:40:13 UTC

(In reply to Vincent Lefèvre from comment #22)
> Note that there is also basic transliteration in the Debian patch. The issue
> is just that when iconv runs under C.UTF-8, it fails on some characters.
> This even succeeds with a non-existing locale, such as:
> 
>   echo a─b | LC_ALL=foo /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT
> 
> However, it may be a more general issue with iconv since I've just noticed
> that the behavior on "é" in the C locale is buggy too.

I am not sure what to think about this one:

~@stitny$ echo a─b | LC_ALL=foo /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT
a-b
~@stitny$ echo é | LC_ALL=foo /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT
?
~@stitny$ echo é | LC_ALL=C.UTF-8 /usr/bin/iconv -f utf-8 -t ascii//TRANSLITe
~@stitny$ rpm -qf /usr/bin/iconv
glibc-2.32-3.1.x86_64
~@stitny$

Comment 24 Matěj Cepl 2020-11-30 14:41:30 UTC

That é example lost EOL, it got transliterated correctly:

~@stitny$ echo é | LC_ALL=C.UTF-8 /usr/bin/iconv -f utf-8 -t ascii//TRANSLIT
e
~@stitny$ 
ZZ

Comment 25 Vincent Lefèvre 2020-11-30 16:37:50 UTC

(In reply to Vincent Lefèvre from comment #22)
> However, it may be a more general issue with iconv since I've just noticed
> that the behavior on "é" in the C locale is buggy too.

I've reported bug 26984 about this one.

Comment 26 Carlos O'Donell 2021-04-28 16:23:30 UTC

Posted v4 for review.
https://sourceware.org/pipermail/libc-alpha/2021-April/125543.html

Comment 27 Carlos O'Donell 2021-12-16 21:31:42 UTC

This is fixed in glibc 2.35 with the following commit that uses the new "codepoint_collation" support for LC_COLLATE.

commit 466f2be6c08070e9113ae2fdc7acd5d8828cba50
Author: Carlos O'Donell <carlos@redhat.com>
Date:   Wed Sep 1 15:19:19 2021 -0400

    Add generic C.UTF-8 locale (Bug 17318)


commit f5117c6504888fab5423282a4607c552b90fd3f9
Author: Carlos O'Donell <carlos@redhat.com>
Date:   Thu Jul 29 22:45:39 2021 -0400

    Add 'codepoint_collation' support for LC_COLLATE.

Comment 28 Carlos O'Donell 2021-12-16 21:50:48 UTC

*** Bug 21302 has been marked as a duplicate of this bug. ***

accounts+sourceware
arthur200126
aurelien
bugzilla
carlos
cloos
drepper.fsp
filbranden
fweimer
j
lautgesetz
maiku.fabian
mcepl
mikaela
myllynen
pachoramos1
pbrobinson
ranky
vapier
vincent-srcware
zbyszek