This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/23232] New: glibc fails to follow collate order in ranges like [a-z]


https://sourceware.org/bugzilla/show_bug.cgi?id=23232

            Bug ID: 23232
           Summary: glibc fails to follow collate order in ranges like
                    [a-z]
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: binaryzebra at gmail dot com
                CC: drepper.fsp at gmail dot com, eggert at gnu dot org,
                    meyering at gmail dot com
  Target Milestone: ---

First: Basic assumptions:

    All the 96 ASCII printable characters (32 to 126, or 0x20 to 0x7e) are:

        $ a=' !"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
        $ a=$a'[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

    And are sorted in the collate order given by the locale in use with:

        $ echo "$a" | grep -o . | sort | paste -sd ''

    If the locale is C, then:

        $ ( export LC_ALL=C         ; echo "$a" | grep -o . | sort | paste -sd
'' )
         !"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
        [\]^_`abcdefghijklmnopqrstuvwxyz{|}~

    In the C locale, glibc (sed) follows the collate order:

        $  ( export LC_ALL=C         ; echo "$a" | sed 's/[^a-z]//g' )
        abcdefghijklmnopqrstuvwxyz

    In some other locale, like: en_US.utf8, the collate order (that seems to be
    documented here: http://collation-charts.org/icu442/icu442-en.html) is:

        $ ( export LC_ALL=en_US.utf8; echo "$a" | grep -o . | sort | paste -sd
"" )
        `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789
        aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

The problem seems to be that glibc does not use such collate order:

        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        abcd

     It should have been aAbBcCd if the same collate order as above were used.

It seems that glibc also limits the characters to be lowercase:

        $ a='789aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËf'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        aáàâåäãªæbcçd

And that only "Latin" characters are allowed (not Greek, not Cyrillic):

        $ a='aAbBcCdDeEfFαβγδεζηθικλμνΰабвгдежзи'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
        abcd

Mixed lowercase-uppecase ranges seem to allow all lowercase and then some
uppercase:

        $ a='aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËfFgGhHiIíÍìÌîÎïÏjJ'
        $ a=$a'kKlLmMnNñÑoOóÓòÒôÔöÖõÕøغpPqQrRsSßtTuUúÚùÙûÛüÜvVwWxXyYýÝÿzZ'
        $ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-D]//g' )
        aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðeéèêëfghiíìîïjklmnñoóòôöõøºpqrsßtuúùûüvwxyýÿz




If this is the expected way in which glibc ranges should work:

       - Where is it clearly documented?

If it is not:

       - What should be the correct collation order to use?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]