On Fedora 24 with glibc-2.23.1 I get the following interesting sort behavior: % echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort +00 -02 +02 -0c On Mac OS X 10.11 I get less surprising behavior: % echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort +00 +02 -02 -0c I've tried to reproduce the first result using <http://demo.icu-project.org/icu-bin/collation.html> but have not managed to find a set of options that will do so. So I'm not sure if it is technically a bug, but I would say that it's at least unexpected and apparently diverges from ICU & CLDR.
Going forward we want glibc to track CLDR more closely. Therefore if you can find a glibc version that exhibits meaningful difference between CLDR, then please file a report, like this one. However, you have too many moving pieces for us to validate this, for example sort is not a good test case because it might itself not use glibc's collation tables for sorting. Can you construct a test case with strcoll that exhibits this problem?
I originally filed a bug against GNU coreutils, and was told that it's behavior of strcoll from glibc which coreutils uses for collation. See: <http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24601>
#include <stdio.h> #include <string.h> #include <stdlib.h> #include <locale.h> int main() { char *str[4], *temp; int i, j, n, c; setlocale(LC_ALL, "en_US.UTF-8"); str[0] = "+00"; str[1] = "-0c"; str[2] = "+02"; str[3] = "-02"; n = 4; for (i = 0; i < n; i++) { for (j = 0; j < n - 1; j++) { c = strcoll(str[j], str[j + 1]) > 0; printf("i = %d j = %d strcoll %s %s = %d\n", i, j, str[i], str[j], c); if (c > 0) { temp = str[j]; str[j] = str[j+1]; str[j+1] = temp; } } } printf("\nSorted List:\n"); for (i = 0; i < n; i++) { puts(str[i]); } return (0); } % ./a.out i = 0 j = 0 strcoll +00 +00 = 0 i = 0 j = 1 strcoll +00 -0c = 1 i = 0 j = 2 strcoll +00 -0c = 1 i = 1 j = 0 strcoll +02 +00 = 0 i = 1 j = 1 strcoll +02 +02 = 1 i = 1 j = 2 strcoll -02 +02 = 0 i = 2 j = 0 strcoll +02 +00 = 0 i = 2 j = 1 strcoll +02 -02 = 0 i = 2 j = 2 strcoll +02 +02 = 0 i = 3 j = 0 strcoll -0c +00 = 0 i = 3 j = 1 strcoll -0c -02 = 0 i = 3 j = 2 strcoll -0c +02 = 0 Sorted List: +00 -02 +02 -0c
On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=20664 > > Carlos O'Donell <carlos at redhat dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|UNCONFIRMED |WAITING > Last reconfirmed| |2016-10-03 > CC| |carlos at redhat dot com > Ever confirmed|0 |1 > > --- Comment #1 from Carlos O'Donell <carlos at redhat dot com> --- > Going forward we want glibc to track CLDR more closely. Therefore if you can > find a glibc version that exhibits meaningful difference between CLDR, then > please file a report, like this one. > > However, you have too many moving pieces for us to validate this, for example > sort is not a good test case because it might itself not use glibc's collation > tables for sorting. > > Can you construct a test case with strcoll that exhibits this problem? I do not think we should aim at following CLDR closely, but we should minimize differences. I actually think we should get CLDR to follow us more closely:-) Bestregards keld
(In reply to keld@keldix.com from comment #4) > On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote: > > https://sourceware.org/bugzilla/show_bug.cgi?id=20664 > > Can you construct a test case with strcoll that exhibits this problem? > > I do not think we should aim at following CLDR closely, but we should > minimize > differences. I actually think we should get CLDR to follow us more closely:-) I certainly agree that harmonization between both projects would be a great goal. Having the best of both projects would be great. While I say "following CLDR" what I mean is probably more accurate to say "harmonized with CLDR." So I will endeavour to use such language in the future.
I am getting collation results as expected (meaning, no difference between en_US.UTF-8 and POSIX) for the example strings with glibc 2.32. Is this issue safe to close?
(In reply to Kirill Elagin from comment #6) > I am getting collation results as expected (meaning, no difference between > en_US.UTF-8 and POSIX) for the example strings with glibc 2.32. > > Is this issue safe to close? In glibc 2.32 we upgraded to Unicode 13.0.0, and glibc 2.35 (Feb 2, 2022) will include Unicode 14.0.0 support. Neither of these updates substantially changed collation (involved in sort). However, I agree with you that Fedora 34 with glibc 2.33 that we get matching results: echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort +00 +02 -02 -0c The collation data always had <U002B> < <U002D> which results in + < -. I'm marking this as RESOLVED/FIXED in glibc 2.33. We can reopen if we run into this again to determine what is the root cause of the original mis-ordering in 2.32.
Just FTR, the original issue was reported against 2.23 (not 2.32).