This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/23232] New: glibc fails to follow collate order in ranges like [a-z]
- From: "binaryzebra at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Fri, 25 May 2018 05:43:28 +0000
- Subject: [Bug libc/23232] New: glibc fails to follow collate order in ranges like [a-z]
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=23232
Bug ID: 23232
Summary: glibc fails to follow collate order in ranges like
[a-z]
Product: glibc
Version: unspecified
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: libc
Assignee: unassigned at sourceware dot org
Reporter: binaryzebra at gmail dot com
CC: drepper.fsp at gmail dot com, eggert at gnu dot org,
meyering at gmail dot com
Target Milestone: ---
First: Basic assumptions:
All the 96 ASCII printable characters (32 to 126, or 0x20 to 0x7e) are:
$ a=' !"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'
$ a=$a'[\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
And are sorted in the collate order given by the locale in use with:
$ echo "$a" | grep -o . | sort | paste -sd ''
If the locale is C, then:
$ ( export LC_ALL=C ; echo "$a" | grep -o . | sort | paste -sd
'' )
!"#$%&'\''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
In the C locale, glibc (sed) follows the collate order:
$ ( export LC_ALL=C ; echo "$a" | sed 's/[^a-z]//g' )
abcdefghijklmnopqrstuvwxyz
In some other locale, like: en_US.utf8, the collate order (that seems to be
documented here: http://collation-charts.org/icu442/icu442-en.html) is:
$ ( export LC_ALL=en_US.utf8; echo "$a" | grep -o . | sort | paste -sd
"" )
`^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
The problem seems to be that glibc does not use such collate order:
$ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
abcd
It should have been aAbBcCd if the same collate order as above were used.
It seems that glibc also limits the characters to be lowercase:
$ a='789aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËf'
$ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
aáàâåäãªæbcçd
And that only "Latin" characters are allowed (not Greek, not Cyrillic):
$ a='aAbBcCdDeEfFαβγδεζηθικλμνΰабвгдежзи'
$ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-d]//g' )
abcd
Mixed lowercase-uppecase ranges seem to allow all lowercase and then some
uppercase:
$ a='aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËfFgGhHiIíÍìÌîÎïÏjJ'
$ a=$a'kKlLmMnNñÑoOóÓòÒôÔöÖõÕøغpPqQrRsSßtTuUúÚùÙûÛüÜvVwWxXyYýÝÿzZ'
$ ( export LC_ALL=en_US.utf8; echo "$a" | sed 's/[^a-D]//g' )
aAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðeéèêëfghiíìîïjklmnñoóòôöõøºpqrsßtuúùûüvwxyýÿz
If this is the expected way in which glibc ranges should work:
- Where is it clearly documented?
If it is not:
- What should be the correct collation order to use?
--
You are receiving this mail because:
You are on the CC list for the bug.