This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug libc/23232] New: glibc fails to follow collate order in ranges like [a-z]

From: "binaryzebra at gmail dot com" <sourceware-bugzilla at sourceware dot org>
To: glibc-bugs at sourceware dot org
Date: Fri, 25 May 2018 05:43:28 +0000
Subject: [Bug libc/23232] New: glibc fails to follow collate order in ranges like [a-z]
Auto-submitted: auto-generated

https://sourceware.org/bugzilla/show_bug.cgi?id=23232

            Bug ID: 23232
           Summary: glibc fails to follow collate order in ranges like
                    [a-z]
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: binaryzebra at gmail dot com
                CC: drepper.fsp at gmail dot com, eggert at gnu dot org,
                    meyering at gmail dot com
  Target Milestone: ---

First: Basic assumptions:

    All the 96 ASCII printable characters (32 to 126, or 0x20 to 0x7e) are:

        $ a=' !"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
        $ a=$a'[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

    And are sorted in the collate order given by the locale in use with:

        $ echo "$a" | grep -o . | sort | paste -sd ''

    If the locale is C, then:

        $ ( export LC_ALL=C         ; echo "$a" | grep -o . | sort | paste -sd
'' )
         !"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
        [\]^_`abcdefghijklmnopqrstuvwxyz{|}~

    In the C locale, glibc (sed) follows the collate order:

        $  ( export LC_ALL=C         ; echo "$a" | sed 's/[^a-z]//g' )
        abcdefghijklmnopqrstuvwxyz

    In some other locale, like: en_US.utf8, the collate order (that seems to be
    documented here: http://collation-charts.org/icu442/icu442-en.html) is:

        $ ( export LC_ALL=en_US.utf8; echo "$a" | grep -o . | sort | paste -sd
"" )
        `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789
        aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

The problem seems to be that glibc does not use such collate order:

        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        abcd

     It should have been aAbBcCd if the same collate order as above were used.

It seems that glibc also limits the characters to be lowercase:

        $ a='789aAáÁàÀâÂåÅäÄãÃªæÆbBcCçÇdDðÐeEéÉèÈêÊëËf'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        aáàâåäãªæbcçd

And that only "Latin" characters are allowed (not Greek, not Cyrillic):

        $ a='aAbBcCdDeEfFαβγδεζηθικλμνΰабвгдежзи'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        abcd

Mixed lowercase-uppecase ranges seem to allow all lowercase and then some
uppercase:

        $ a='aAáÁàÀâÂåÅäÄãÃªæÆbBcCçÇdDðÐeEéÉèÈêÊëËfFgGhHiIíÍìÌîÎïÏjJ'
        $ a=$a'kKlLmMnNñÑoOóÓòÒôÔöÖõÕøØºpPqQrRsSßtTuUúÚùÙûÛüÜvVwWxXyYýÝÿzZ'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-D]//g' )
        aAáÁàÀâÂåÅäÄãÃªæÆbBcCçÇdDðeéèêëfghiíìîïjklmnñoóòôöõøºpqrsßtuúùûüvwxyýÿz




If this is the expected way in which glibc ranges should work:

       - Where is it clearly documented?

If it is not:

       - What should be the correct collation order to use?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Follow-Ups:
- [Bug libc/23232] glibc fails to follow collate order in ranges like [a-z]
  - From: fweimer at redhat dot com
- [Bug libc/23232] glibc fails to follow collate order in ranges like [a-z]
  - From: fweimer at redhat dot com
- [Bug regex/23232] glibc fails to follow collate order in ranges like [a-z]
  - From: jsm28 at gcc dot gnu.org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]