Sources Bugzilla – Bug 52
Repeated and nested subexpressions (reproducible in most other engines)
Last modified: 2012-03-08 04:36:24 UTC
Most engines, including glibc's have problems with the following regexes: (b(c)|d(e))* match against bcde --> \1=de, \3=e, but erroneously \2=c (a(b)*)* match against aba --> \1=a, but erroneously \2=b This is just a nit, because most other regex matching engines, either DFA-based or syntax-directed, have problems with this kind of regex. Note that while the second can be fixed by looking ahead of a OP_DUP_ASTERISK and preventively clearing an OP_OPEN_SUBEXP to which the OP_DUP_ASTERISK epsilon transits, the first is much harder and requires storing or building a "tree" of how the subexpressions nest (which would fix the second one as well). Paolo
I doubt this will be fixed.
Do you have a small testcase demonstrating this problem?
Created attachment 875 [details] Patch taken from bug comments.
Sorry. That attachment was meant for bug 14.
no feedback to testcase query (echo aba | sed -e 's/\(a\(b\)*\)*/\2/' does produce b, but it's not clear to me what should it print instead)
echo aba | sed -e 's/\(a\(b\)*\)*/\2/' should print nothing because the final match of \1 is "a" and \2 must be a substring of \1. Reopening...
... and leaving suspended.