Bug 52

Summary: Repeated and nested subexpressions (reproducible in most other engines)
Product: glibc Reporter: Paolo Bonzini <bonzini>
Component: regexAssignee: GOTO Masanori <gotom>
Status: SUSPENDED ---    
Severity: minor CC: carlos, glibc-bugs
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Patch taken from bug comments.

Description Paolo Bonzini 2004-03-01 15:48:08 UTC
Most engines, including glibc's have problems with the following regexes:

(b(c)|d(e))*    match against bcde  -->  \1=de, \3=e, but erroneously \2=c
(a(b)*)*        match against aba   -->  \1=a, but erroneously \2=b

This is just a nit, because most other regex matching engines, either DFA-based 
or syntax-directed, have problems with this kind of regex.

Note that while the second can be fixed by looking ahead of a OP_DUP_ASTERISK 
and preventively clearing an OP_OPEN_SUBEXP to which the OP_DUP_ASTERISK 
epsilon transits, the first is much harder and requires storing or building 
a "tree" of how the subexpressions nest (which would fix the second one as 
well).

Paolo
Comment 1 Paolo Bonzini 2004-11-13 12:55:29 UTC
I doubt this will be fixed.
Comment 2 Dwayne Grant McConnell 2006-02-21 15:54:26 UTC
Do you have a small testcase demonstrating this problem?
Comment 3 Dwayne Grant McConnell 2006-02-21 16:28:01 UTC
Created attachment 875 [details]
Patch taken from bug comments.
Comment 4 Dwayne Grant McConnell 2006-02-21 16:29:11 UTC
Sorry. That attachment was meant for bug 14.
Comment 5 Petr Baudis 2010-06-01 02:06:41 UTC
no feedback to testcase query (echo aba | sed -e 's/\(a\(b\)*\)*/\2/' does
produce b, but it's not clear to me what should it print instead)
Comment 6 Paolo Bonzini 2010-06-01 06:26:33 UTC
echo aba | sed -e 's/\(a\(b\)*\)*/\2/'

should print nothing because the final match of \1 is "a" and \2 must be a substring of \1.

Reopening...
Comment 7 Paolo Bonzini 2010-06-01 06:27:16 UTC
... and leaving suspended.