Sources Bugzilla – Bug 52
Repeated and nested subexpressions (reproducible in most other engines)
Last modified: 2012-03-08 04:36:24 UTC
Most engines, including glibc's have problems with the following regexes:
(b(c)|d(e))* match against bcde --> \1=de, \3=e, but erroneously \2=c
(a(b)*)* match against aba --> \1=a, but erroneously \2=b
This is just a nit, because most other regex matching engines, either DFA-based
or syntax-directed, have problems with this kind of regex.
Note that while the second can be fixed by looking ahead of a OP_DUP_ASTERISK
and preventively clearing an OP_OPEN_SUBEXP to which the OP_DUP_ASTERISK
epsilon transits, the first is much harder and requires storing or building
a "tree" of how the subexpressions nest (which would fix the second one as
I doubt this will be fixed.
Do you have a small testcase demonstrating this problem?
Created attachment 875 [details]
Patch taken from bug comments.
Sorry. That attachment was meant for bug 14.
no feedback to testcase query (echo aba | sed -e 's/\(a\(b\)*\)*/\2/' does
produce b, but it's not clear to me what should it print instead)
echo aba | sed -e 's/\(a\(b\)*\)*/\2/'
should print nothing because the final match of \1 is "a" and \2 must be a substring of \1.
... and leaving suspended.