10290 – using REG_ICASE can break ranges

Bug 10290 - using REG_ICASE can break ranges

Summary: using REG_ICASE can break ranges

Status:	NEW

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	regex (show other bugs)
Version:	2.9

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-06-17 15:47 UTC by Jeffrey Bastian
Modified:	2014-07-01 16:36 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
test case (301 bytes, text/plain) 2009-06-17 15:48 UTC, Jeffrey Bastian	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeffrey Bastian 2009-06-17 15:47:05 UTC

Using a regular expression range like [C-a] works fine if compiled with
regcomp() with just the REG_EXTENDED flag, but if the REG_ICASE flag is added
too, regcomp() returns an error "Invalid range end".

Testing other ranges with REG_ICASE reveals:
    [A-Z^-z] is invalid: Invalid range end (11)
    [A-Z^_`a-z] is ok
    [C-a] is invalid: Invalid range end (11)
    [C-f] is ok
    [_-a] is invalid: Invalid range end (11)
    [<-a] is ok
    [z-{] is ok

It appears that regcomp() is capitalizing the range if the REG_ICASE flag is
used, thus [C-a] becomes [C-A] and since A comes before C, the range is invalid.
 Likewise, in locales that match ASCII, ^ becomes before z, but after Z, so
[A-Z^-z] becomes invalid, and _ comes after A but before a, so [_-a] becomes
invalid.

If this is not considered a bug, then at the very least, the regex(3) man page
should note the side-effects of using REG_ICASE.

Comment 1 Jeffrey Bastian 2009-06-17 15:48:14 UTC

Created attachment 4004 [details]
test case

Comment 2 Paolo Bonzini 2010-09-21 15:10:51 UTC

Note [C-a] is invalid anyway:
$ sed -n '/[C-a]/p' /dev/null
sed: -e expression #1, char 7: Invalid range end

However [c-A] is not and shows the bug:
$ sed -n '/[c-A]/p' /dev/null
$ sed -n '/[c-A]/I p'
sed: -e expression #1, char 9: Invalid range end

Comment 3 Eric Blake 2010-09-21 15:30:07 UTC

In which locale?  In the POSIX locale with an ASCII (or similar) encoding, [C-a]
is well defined:

$ LC_ALL=C sed -n '/[C-a]/p' /dev/null
$ LC_ALL=en_US.UTF-8 sed -n '/[C-a]/p' /dev/null
sed: -e expression #1, char 7: Invalid range end

And since range expressions are only well-defined in the POSIX locale, the point
still remains that the case-insensitive flag is messing things up:

$ LC_ALL=C sed -n '/[C-a]/I p' /dev/null
sed: -e expression #1, char 9: Invalid range end

Also, the resolution of this bug should consider
http://sources.redhat.com/bugzilla/show_bug.cgi?id=12045, which is unrelated to
the REG_ICASE flag.

Comment 4 Paolo Bonzini 2010-09-21 15:58:26 UTC

I was using LC_ALL=en_US.UTF-8 in comment #2.

Comment 5 Jackie Rosen 2014-02-16 19:30:18 UTC Comment hidden (spam)

*** Bug 260998 has been marked as a duplicate of this bug. ***
Seen from the domain http://volichat.com
Page where seen: http://volichat.com/adult-chat-rooms
Marked for reference. Resolved as fixed @bugzilla.