Bug 21302 - strcoll does not correctly follow locale-specified order in some cases
Summary: strcoll does not correctly follow locale-specified order in some cases
Status: ASSIGNED
Alias: None
Product: glibc
Classification: Unclassified
Component: locale (show other bugs)
Version: 2.23
: P2 normal
Target Milestone: ---
Assignee: Carlos O'Donell
URL:
Keywords:
Depends on:
Blocks: 17318
  Show dependency treegraph
 
Reported: 2017-03-24 11:12 UTC by David Kamholz
Modified: 2019-01-02 08:45 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2017-10-28 00:00:00
fweimer: security-


Attachments
test file (96 bytes, text/plain)
2017-03-24 11:12 UTC, David Kamholz
Details
C.UTF-8 locale file (833.02 KB, text/plain)
2017-03-24 11:23 UTC, David Kamholz
Details
C program that reproduces the issue (150 bytes, text/x-csrc)
2017-03-24 20:30 UTC, David Kamholz
Details
swbz21302-repro.c (472 bytes, text/x-csrc)
2017-03-27 19:04 UTC, Carlos O'Donell
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Kamholz 2017-03-24 11:12:57 UTC
Created attachment 9939 [details]
test file

Consider the following file sorttest.txt, pre-sorted in Unicode codepoint order:

!
ズざら
セーリングボートは
モエ
¥
𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼

If I run "LC_COLLATE=C sort sorttest.txt", using the hard-coded C locale, the output is unchanged -- that is, it is sorted in codepoint order as expected. However, if I run "LC_COLLATE=C.UTF-8 sort sorttest.txt" on Ubuntu, which uses a locale file defining collation straightforwardly in the codepoint order, I get the following unexpected result:

𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼
!
ズざら
セーリングボートは
モエ
¥

To get more detail on what's going on, one can run:

$ LC_ALL=C.UTF-8 sort sorttest.txt | perl -CSAD -ne 'chomp; printf "%s\tU+%05X\n", $_, ord'
𐀎	U+1000E
𐀘	U+10018
𐀛	U+1001B
𫛛	U+2B6DB
𫛞	U+2B6DE
𫛢	U+2B6E2
𫛭	U+2B6ED
𫛶	U+2B6F6
𫛸	U+2B6F8
𫟷	U+2B7F7
𫟼	U+2B7FC
!	U+00021
ズざら	U+0FF7D
セーリングボートは	U+0FF7E
モエ	U+0FF93
¥	U+0FFE5

Another example:

$ perl -CSAD -E 'for my $b (0, 0xF000, 0x10000) { for my $c (0x00, 0x01, 0x21) { $_ = $b + $c; printf "%s\tU+%05X\n", chr, $_} }' | LC_COLLATE=C.UTF-8 sort

	U+00000
𐀀	U+10000
𐀁	U+10001
𐀡	U+10021
	U+00001
!	U+00021
	U+0F000
	U+0F001
	U+0F021

The issue looks to be that codepoints above 0xFFFF come before the rest, except that U+0000 somehow always comes first.

It's definitely not just the "sort" command that's broken. I first noticed this issue in a PostgreSQL database that was using the C.UTF-8 locale's collation order. Given the straightforwardness of the locale file in question (/usr/share/i18n/locales/C on Ubuntu), it's hard to believe the fault lies outside glibc. 

The above commands were tested on Ubuntu 16.04 with glibc 2.23, but the same issue has been reproduced on earlier and later versions of glibc (2.19, 2.24, 2.25).
Comment 1 David Kamholz 2017-03-24 11:15:37 UTC
The Ubuntu C.UTF-8 locale can be downloaded from http://packages.ubuntu.com/yakkety/locales.
Comment 2 David Kamholz 2017-03-24 11:23:33 UTC
Created attachment 9940 [details]
C.UTF-8 locale file
Comment 3 Carlos O'Donell 2017-03-24 14:25:05 UTC
(In reply to David Kamholz from comment #0)
> It's definitely not just the "sort" command that's broken. I first noticed
> this issue in a PostgreSQL database that was using the C.UTF-8 locale's
> collation order. Given the straightforwardness of the locale file in
> question (/usr/share/i18n/locales/C on Ubuntu), it's hard to believe the
> fault lies outside glibc. 
> 
> The above commands were tested on Ubuntu 16.04 with glibc 2.23, but the same
> issue has been reproduced on earlier and later versions of glibc (2.19,
> 2.24, 2.25).

Could you please put together a strcoll-based test case that shows the issue? That way I can take this upstream to discuss.
Comment 4 David Kamholz 2017-03-24 19:20:03 UTC
Isn't using the sort command already "strcoll-based"? I've checked its source, and it relies on strcoll for sorting. The fact that sort and PostgreSQL produce identical results suggests it's not a quirk or bug of either implementation, but rather a result of their shared reliance on strcoll.

By upstream do you mean Ubuntu, or what? It's hard to imagine there's an issue with how Ubuntu defined the collation order, since it's literally just a list of thousands of codepoints in order. Conceivably there's something wrong with the locale generated from it, but since they just use locale-gen, I suspect that the attached C.UTF-8 locale definition and test file are enough to reproduce the issue with any glibc.
Comment 5 Carlos O'Donell 2017-03-24 19:44:58 UTC
(In reply to David Kamholz from comment #4)
> Isn't using the sort command already "strcoll-based"? I've checked its
> source, and it relies on strcoll for sorting. The fact that sort and
> PostgreSQL produce identical results suggests it's not a quirk or bug of
> either implementation, but rather a result of their shared reliance on
> strcoll.

It's easier to analyze and pass the test around if it's a single C source file that can be compiled and used to verify the problem. In this case I'm asking for your help to reduce the problem down to the smallest possible test. And the answer is "No" in some versions of sort don't use strcoll, they had custom collation code, though modern sort should, but may not depending on distro patches.
 
> By upstream do you mean Ubuntu, or what? It's hard to imagine there's an
> issue with how Ubuntu defined the collation order, since it's literally just
> a list of thousands of codepoints in order. Conceivably there's something
> wrong with the locale generated from it, but since they just use locale-gen,
> I suspect that the attached C.UTF-8 locale definition and test file are
> enough to reproduce the issue with any glibc.

By upstream I mean libc-alpha@sourceware.org. There might be a problem in the forward sorting.
Comment 6 David Kamholz 2017-03-24 20:30:37 UTC
Created attachment 9943 [details]
C program that reproduces the issue

OK, I've attached a short C program that reproduces the issue.
Comment 7 David Kamholz 2017-03-27 16:38:07 UTC
Any update on this? Did you receive the program and were you able to reproduce the issue? I just want to make sure it doesn't get lost. :-)
Comment 8 Carlos O'Donell 2017-03-27 19:03:28 UTC
(In reply to David Kamholz from comment #7)
> Any update on this? Did you receive the program and were you able to
> reproduce the issue? I just want to make sure it doesn't get lost. :-)

I have the reproducer, and I agree that we aren't sorting by using the rules in the locale, at least it doesn't look like it.

New locale: C.UTF-8
FAIL: ! >= U+100E (32)
FAIL: ¥ >= U+100E (236)
PASS: ! < ¥ (-204)
PASS: ! < ¡ (-159)
PASS: ! < U+800 (-188)
FAIL ! >= U+10000 (32)
Comment 9 Carlos O'Donell 2017-03-27 19:04:49 UTC
Created attachment 9951 [details]
swbz21302-repro.c

#!/bin/bash
set -x
set -e
BUILD=/home/carlos/build/glibc
gcc -Wl,--dynamic-linker=$BUILD/elf/ld.so -Wl,-rpath=$BUILD:$BUILD/elf -Wall -pedantic -O0 -g3 -o swbz21302-repro swbz21302-repro.c

[carlos@athas swbz21302]$ ./swbz21302-repro 
New locale: C.UTF-8
FAIL: ! >= U+100E (32)
FAIL: ¥ >= U+100E (236)
PASS: ! < ¥ (-204)
PASS: ! < ¡ (-159)
PASS: ! < U+800 (-188)
FAIL ! >= U+10000 (32)
Comment 10 Carlos O'Donell 2017-10-12 07:36:18 UTC
While working on C.UTF-8 for upstream (one that does no transliteration, but should sort via code points) I added a new collation test that shows this problem. I'll have to work out what's going on in string/strcoll_l.c that is causing collation issues for the 4-byte UTF-8 characters.
Comment 11 Carlos O'Donell 2017-10-28 02:26:30 UTC
OK, I have fixed the code-point collation sorting issue.

There are 2 problems:

(a) The collation table builder and thus the weights ignores characters in the collation specification if they do not exactly match the hash of the symbolic name from the charmap. This is arguably a QoI issue, but it needs an explicit warning for all UTF-8 locales to catch typos in the collation tables.

(b) Since the UTF-8 charmap uses 4 or 8 character code point names, the collation must also use *identically* matching symbols or those symbols are silently ignored and have no weights. This is where the Debian and Fedora collations got it wrong, effectively we have giant ranges of typos (and ellipsis generating typos in the thousands) that do not have correct weights.

Once I added the new warnings for (a), I could find all the problems with the locale file and fix (b).

To solve this I'm adding a new --warning=missingcollchar warning which I plan to turn on for all locales being compiled with UTF-8, it will also be turned on by verbose, such that users can see these warnings when developing a locale. We cannot turn them on by default because it is entirely allowed to have a collation sequence whose characters may not exist in the charmap you are using, and so can be safely ignored.

After that I'm going to send my C.UTF-8 patch upstream for review so all the distros can have a harmonized C.UTF-8 to use with correct collation.
Comment 12 Zbigniew Jędrzejewski-Szmek 2019-01-02 08:45:38 UTC
What's the status here?