20664 – Unexpected collation in en_US.UTF-8, different to ICU CLDR

Bug 20664 - Unexpected collation in en_US.UTF-8, different to ICU CLDR

Summary: Unexpected collation in en_US.UTF-8, different to ICU CLDR

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	localedata (show other bugs)
Version:	2.23

Importance:	P2 normal
Target Milestone:	2.33
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-10-03 22:19 UTC by mathew
Modified:	2021-10-11 21:00 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:	2016-10-03 00:00:00

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description mathew 2016-10-03 22:19:03 UTC

On Fedora 24 with glibc-2.23.1 I get the following interesting sort behavior:

% echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
-02
+02
-0c

On Mac OS X 10.11 I get less surprising behavior:

% echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
+02
-02
-0c

I've tried to reproduce the first result using <http://demo.icu-project.org/icu-bin/collation.html> but have not managed to find a set of options that will do so.

So I'm not sure if it is technically a bug, but I would say that it's at least unexpected and apparently diverges from ICU & CLDR.

Comment 1 Carlos O'Donell 2016-10-03 23:10:56 UTC

Going forward we want glibc to track CLDR more closely. Therefore if you can find a glibc version that exhibits meaningful difference between CLDR, then please file a report, like this one.

However, you have too many moving pieces for us to validate this, for example sort is not a good test case because it might itself not use glibc's collation tables for sorting.

Can you construct a test case with strcoll that exhibits this problem?

Comment 2 mathew 2016-10-03 23:47:20 UTC

I originally filed a bug against GNU coreutils, and was told that it's behavior of strcoll from glibc which coreutils uses for collation. See:

<http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24601>

Comment 3 mathew 2016-10-04 16:23:44 UTC

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>

int main() {
  char *str[4], *temp;
  int i, j, n, c;

  setlocale(LC_ALL, "en_US.UTF-8");

  str[0] = "+00";
  str[1] = "-0c";
  str[2] = "+02";
  str[3] = "-02";

  n = 4;
  for (i = 0; i < n; i++) {
    for (j = 0; j < n - 1; j++) {
      c = strcoll(str[j], str[j + 1]) > 0;
      printf("i = %d j = %d strcoll %s %s = %d\n", i, j, str[i], str[j], c);
      if (c > 0) {
        temp = str[j];
        str[j] = str[j+1];
        str[j+1] = temp;
      }
    }
  }

  printf("\nSorted List:\n");
  for (i = 0; i < n; i++) {
    puts(str[i]);
  }

  return (0);
}

% ./a.out 
i = 0 j = 0 strcoll +00 +00 = 0
i = 0 j = 1 strcoll +00 -0c = 1
i = 0 j = 2 strcoll +00 -0c = 1
i = 1 j = 0 strcoll +02 +00 = 0
i = 1 j = 1 strcoll +02 +02 = 1
i = 1 j = 2 strcoll -02 +02 = 0
i = 2 j = 0 strcoll +02 +00 = 0
i = 2 j = 1 strcoll +02 -02 = 0
i = 2 j = 2 strcoll +02 +02 = 0
i = 3 j = 0 strcoll -0c +00 = 0
i = 3 j = 1 strcoll -0c -02 = 0
i = 3 j = 2 strcoll -0c +02 = 0

Sorted List:
+00
-02
+02
-0c

Comment 4 keld@keldix.com 2016-12-20 16:01:07 UTC

On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=20664
> 
> Carlos O'Donell <carlos at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |WAITING
>    Last reconfirmed|                            |2016-10-03
>                  CC|                            |carlos at redhat dot com
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
> Going forward we want glibc to track CLDR more closely. Therefore if you can
> find a glibc version that exhibits meaningful difference between CLDR, then
> please file a report, like this one.
> 
> However, you have too many moving pieces for us to validate this, for example
> sort is not a good test case because it might itself not use glibc's collation
> tables for sorting.
> 
> Can you construct a test case with strcoll that exhibits this problem?

I do not think we should aim at following CLDR closely, but we should minimize
differences. I actually think we should get CLDR to follow us more closely:-)

Bestregards
keld

Comment 5 Carlos O'Donell 2016-12-21 19:15:50 UTC

(In reply to keld@keldix.com from comment #4)
> On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=20664
> > Can you construct a test case with strcoll that exhibits this problem?
> 
> I do not think we should aim at following CLDR closely, but we should
> minimize
> differences. I actually think we should get CLDR to follow us more closely:-)

I certainly agree that harmonization between both projects would be a great goal. Having the best of both projects would be great. While I say "following CLDR" what I mean is probably more accurate to say "harmonized with CLDR." So I will endeavour to use such language in the future.

Comment 6 Kirill Elagin 2021-10-11 20:18:56 UTC

I am getting collation results as expected (meaning, no difference between en_US.UTF-8 and POSIX) for the example strings with glibc 2.32.

Is this issue safe to close?

Comment 7 Carlos O'Donell 2021-10-11 20:51:47 UTC

(In reply to Kirill Elagin from comment #6)
> I am getting collation results as expected (meaning, no difference between
> en_US.UTF-8 and POSIX) for the example strings with glibc 2.32.
> 
> Is this issue safe to close?

In glibc 2.32 we upgraded to Unicode 13.0.0, and glibc 2.35 (Feb 2, 2022) will include Unicode 14.0.0 support. Neither of these updates substantially changed collation (involved in sort). However, I agree with you that Fedora 34 with glibc 2.33 that we get matching results:

echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
+02
-02
-0c

The collation data always had <U002B> < <U002D> which results in + < -. I'm marking this as RESOLVED/FIXED in glibc 2.33. We can reopen if we run into this again to determine what is the root cause of the original mis-ordering in 2.32.

Comment 8 Kirill Elagin 2021-10-11 21:00:37 UTC

Just FTR, the original issue was reported against 2.23 (not 2.32).