Bug 10290 - using REG_ICASE can break ranges
Summary: using REG_ICASE can break ranges
Status: NEW
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: 2.9
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-17 15:47 UTC by Jeffrey Bastian
Modified: 2014-07-01 16:36 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
test case (301 bytes, text/plain)
2009-06-17 15:48 UTC, Jeffrey Bastian
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeffrey Bastian 2009-06-17 15:47:05 UTC
Using a regular expression range like [C-a] works fine if compiled with
regcomp() with just the REG_EXTENDED flag, but if the REG_ICASE flag is added
too, regcomp() returns an error "Invalid range end".

Testing other ranges with REG_ICASE reveals:
    [A-Z^-z] is invalid: Invalid range end (11)
    [A-Z^_`a-z] is ok
    [C-a] is invalid: Invalid range end (11)
    [C-f] is ok
    [_-a] is invalid: Invalid range end (11)
    [<-a] is ok
    [z-{] is ok

It appears that regcomp() is capitalizing the range if the REG_ICASE flag is
used, thus [C-a] becomes [C-A] and since A comes before C, the range is invalid.
 Likewise, in locales that match ASCII, ^ becomes before z, but after Z, so
[A-Z^-z] becomes invalid, and _ comes after A but before a, so [_-a] becomes
invalid.

If this is not considered a bug, then at the very least, the regex(3) man page
should note the side-effects of using REG_ICASE.
Comment 1 Jeffrey Bastian 2009-06-17 15:48:14 UTC
Created attachment 4004 [details]
test case
Comment 2 Paolo Bonzini 2010-09-21 15:10:51 UTC
Note [C-a] is invalid anyway:
$ sed -n '/[C-a]/p' /dev/null
sed: -e expression #1, char 7: Invalid range end

However [c-A] is not and shows the bug:
$ sed -n '/[c-A]/p' /dev/null
$ sed -n '/[c-A]/I p'
sed: -e expression #1, char 9: Invalid range end
Comment 3 Eric Blake 2010-09-21 15:30:07 UTC
In which locale?  In the POSIX locale with an ASCII (or similar) encoding, [C-a]
is well defined:

$ LC_ALL=C sed -n '/[C-a]/p' /dev/null
$ LC_ALL=en_US.UTF-8 sed -n '/[C-a]/p' /dev/null
sed: -e expression #1, char 7: Invalid range end

And since range expressions are only well-defined in the POSIX locale, the point
still remains that the case-insensitive flag is messing things up:

$ LC_ALL=C sed -n '/[C-a]/I p' /dev/null
sed: -e expression #1, char 9: Invalid range end

Also, the resolution of this bug should consider
http://sources.redhat.com/bugzilla/show_bug.cgi?id=12045, which is unrelated to
the REG_ICASE flag.
Comment 4 Paolo Bonzini 2010-09-21 15:58:26 UTC
I was using LC_ALL=en_US.UTF-8 in comment #2.
Comment 5 Jackie Rosen 2014-02-16 19:30:18 UTC Comment hidden (spam)