Bug 52

Summary: Repeated and nested subexpressions (reproducible in most other engines)
Product: glibc Reporter: Paolo Bonzini <bonzini>
Component: regexAssignee: GOTO Masanori <gotom>
Status: NEW ---    
Severity: minor CC: afaq.ahmed, bugdal, carlos, eblake, eggert, glibc-bugs
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Patch taken from bug comments.

Description Paolo Bonzini 2004-03-01 15:48:08 UTC
Most engines, including glibc's have problems with the following regexes:

(b(c)|d(e))*    match against bcde  -->  \1=de, \3=e, but erroneously \2=c
(a(b)*)*        match against aba   -->  \1=a, but erroneously \2=b

This is just a nit, because most other regex matching engines, either DFA-based 
or syntax-directed, have problems with this kind of regex.

Note that while the second can be fixed by looking ahead of a OP_DUP_ASTERISK 
and preventively clearing an OP_OPEN_SUBEXP to which the OP_DUP_ASTERISK 
epsilon transits, the first is much harder and requires storing or building 
a "tree" of how the subexpressions nest (which would fix the second one as 
well).

Paolo
Comment 1 Paolo Bonzini 2004-11-13 12:55:29 UTC
I doubt this will be fixed.
Comment 2 Dwayne Grant McConnell 2006-02-21 15:54:26 UTC
Do you have a small testcase demonstrating this problem?
Comment 3 Dwayne Grant McConnell 2006-02-21 16:28:01 UTC
Created attachment 875 [details]
Patch taken from bug comments.
Comment 4 Dwayne Grant McConnell 2006-02-21 16:29:11 UTC
Sorry. That attachment was meant for bug 14.
Comment 5 Petr Baudis 2010-06-01 02:06:41 UTC
no feedback to testcase query (echo aba | sed -e 's/\(a\(b\)*\)*/\2/' does
produce b, but it's not clear to me what should it print instead)
Comment 6 Paolo Bonzini 2010-06-01 06:26:33 UTC
echo aba | sed -e 's/\(a\(b\)*\)*/\2/'

should print nothing because the final match of \1 is "a" and \2 must be a substring of \1.

Reopening...
Comment 7 Paolo Bonzini 2010-06-01 06:27:16 UTC
... and leaving suspended.
Comment 8 Eric Blake 2013-07-03 21:25:43 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=876766 is another report of this bug.
Comment 9 Rich Felker 2013-07-04 00:28:23 UTC
I think the newly linked Red Hat bug report is more convincing that this needs to be fixed. What is the justification for the "SUSPENDED" status?
Comment 10 Carlos O'Donell 2013-07-04 03:38:29 UTC
(In reply to Rich Felker from comment #9)
> I think the newly linked Red Hat bug report is more convincing that this
> needs to be fixed. What is the justification for the "SUSPENDED" status?

There is no justification, moved to NEW.
Comment 11 Paolo Bonzini 2013-07-04 07:21:56 UTC
The justification for the "suspended" state is that this would be very complicated to fix and wouldn't really add anything to the quality of the implementation.

Even if the "(a(b)*)*" case would not be hard to fix, I'm not sure we can say the same of the backreference testcase in the RH bug ('(a(b)*)*\2' matched against 'abab') or the more complicated '(a(b)*)*x\1\2' matched against 'abaxa'.

The RH bugzilla was opened by Eric doesn't really add anything to the urgency of this bug; Fedora bugs that also exist upstream can be closed liberally, and that's what I did.

The grep bug on Savannah (https://savannah.gnu.org/bugs/?37737) might add something, but it is not clear if the user actually encountered it in a real-world usecase.  The same "bug" is present in hardly every regular expression matcher, and I would suggest that the Austin group gives more leeway to implementations.  For example, the following rules could work:

- it is undefined _which_ occurrence of the sub-RE is captured by a parenthesized group (and matched in backreferences) if the subgroup, or any of its parents, is quantified with + * {};

- the backreference should only match the empty string only if the corresponding sub-RE can be empty, or if the corresponding parenthesized group, or any of its parents, is quantified with * or {0,...}.
Comment 12 Rich Felker 2013-07-04 07:34:49 UTC
> The same "bug" is present in hardly every regular expression matcher,

Citation needed. GNU regex is the only implementation I've seen with this bug that purports to be an implementation of POSIX RE.

The implementation we have in musl, which is based on TRE, passes the tests in this bug report just fine.

As for the RH bug report, the reason I saw it as adding urgency is that it shows how this bug can affect whether or not the regex matches at all, as opposed to "just" a pedantic matter of extraneous data in the resulting subexpression match array.
Comment 13 Paolo Bonzini 2013-07-04 07:42:51 UTC
At the time I reported the bug, I'm pretty sure I checked a few proprietary Unices.  Also, even though it is not POSIX RE, Perl regular expressions have the same behavior.  While it is true that backreferences can make this bug affect the overall outcome of the match, it is still a very weird regular expression.

Thanks for mentioning musl's matcher, I'll check it out.  Matching backreferences at a decent speed _and_ obeying the POSIX leftmost/longest rules is very hard to do.  The glibc matcher unfortunately is very badly documented and the backreference part of it is basically black magic. :(
Comment 14 Rich Felker 2013-07-04 07:51:28 UTC
On Thu, Jul 04, 2013 at 07:42:51AM +0000, bonzini at gnu dot org wrote:
> Unices.  Also, even though it is not POSIX RE, Perl regular expressions have
> the same behavior.

Perl RE's are so different from POSIX that I would not consider them
in this discussion.

> Thanks for mentioning musl's matcher, I'll check it out.

The upstream source we got it from, TRE, is where the magic came from.
The author has some good documentation on how it works.

> Matching backreferences at a decent speed _and_ obeying the POSIX
> leftmost/longest rules is very hard to do.

Currently the speed is somewhat less than glibc, but not horrible by
any means. A large part of the cost is probably mbtowc. I'm planning
to eventually adapt it to compile the UTF-8 RE directly to a
byte-based state machine with actual character identity decoding only
when matching against character classes, but it's low priority right
now.
Comment 15 Paolo Bonzini 2013-07-04 08:04:52 UTC
I'm curious about the speed on testcases with backreferences.  The non-backreference part of the bug would be relatively easy to fix in glibc too.

The musl source says backreferences "can be spectacularly expensive", and looking at the code I expect that to be true.

Unfortunately, GNU sed uses the GNU regex API, not the POSIX one, but it has some "interesting" examples in its testsuite.  Perhaps you can try those tests on another sed, compiled against both glibc and musl.
Comment 16 Paul Eggert 2013-07-04 15:05:02 UTC
The same issue is present in Solaris 9 and Solaris 11.1 POSIX grep (/usr/xpg4/bin/grep), the only non-GNU greps I have easy access to.

$ echo abab | /usr/xpg4/bin/grep '\(a\(b\)*\)*\2'
abab

Perhaps someone should file a bug report with POSIX folks?  It's hard to believe that the current requirement is intended, if so few implementations conform.