This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug locale/21302] New: strcoll does not correctly follow locale-specified order in some cases


https://sourceware.org/bugzilla/show_bug.cgi?id=21302

            Bug ID: 21302
           Summary: strcoll does not correctly follow locale-specified
                    order in some cases
           Product: glibc
           Version: 2.23
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: lautgesetz at gmail dot com
  Target Milestone: ---

Created attachment 9939
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9939&action=edit
test file

Consider the following file sorttest.txt, pre-sorted in Unicode codepoint
order:

!
ズざら
セーリングボートは
モエ
¥
𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼

If I run "LC_COLLATE=C sort sorttest.txt", using the hard-coded C locale, the
output is unchanged -- that is, it is sorted in codepoint order as expected.
However, if I run "LC_COLLATE=C.UTF-8 sort sorttest.txt" on Ubuntu, which uses
a locale file defining collation straightforwardly in the codepoint order, I
get the following unexpected result:

𐀎
𐀘
𐀛
𫛛
𫛞
𫛢
𫛭
𫛶
𫛸
𫟷
𫟼
!
ズざら
セーリングボートは
モエ
¥

To get more detail on what's going on, one can run:

$ LC_ALL=C.UTF-8 sort sorttest.txt | perl -CSAD -ne 'chomp; printf
"%s\tU+%05X\n", $_, ord'
𐀎       U+1000E
𐀘       U+10018
𐀛       U+1001B
𫛛       U+2B6DB
𫛞       U+2B6DE
𫛢       U+2B6E2
𫛭       U+2B6ED
𫛶       U+2B6F6
𫛸       U+2B6F8
𫟷       U+2B7F7
𫟼       U+2B7FC
!       U+00021
ズざら    U+0FF7D
セーリングボートは     U+0FF7E
モエ      U+0FF93
¥       U+0FFE5

Another example:

$ perl -CSAD -E 'for my $b (0, 0xF000, 0x10000) { for my $c (0x00, 0x01, 0x21)
{ $_ = $b + $c; printf "%s\tU+%05X\n", chr, $_} }' | LC_COLLATE=C.UTF-8 sort

        U+00000
𐀀       U+10000
𐀁       U+10001
𐀡       U+10021
        U+00001
!       U+00021
       U+0F000
       U+0F001
       U+0F021

The issue looks to be that codepoints above 0xFFFF come before the rest, except
that U+0000 somehow always comes first.

It's definitely not just the "sort" command that's broken. I first noticed this
issue in a PostgreSQL database that was using the C.UTF-8 locale's collation
order. Given the straightforwardness of the locale file in question
(/usr/share/i18n/locales/C on Ubuntu), it's hard to believe the fault lies
outside glibc. 

The above commands were tested on Ubuntu 16.04 with glibc 2.23, but the same
issue has been reproduced on earlier and later versions of glibc (2.19, 2.24,
2.25).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]