Most engines, including glibc's have problems with the following regexes: (b(c)|d(e))* match against bcde --> \1=de, \3=e, but erroneously \2=c (a(b)*)* match against aba --> \1=a, but erroneously \2=b This is just a nit, because most other regex matching engines, either DFA-based or syntax-directed, have problems with this kind of regex. Note that while the second can be fixed by looking ahead of a OP_DUP_ASTERISK and preventively clearing an OP_OPEN_SUBEXP to which the OP_DUP_ASTERISK epsilon transits, the first is much harder and requires storing or building a "tree" of how the subexpressions nest (which would fix the second one as well). Paolo
I doubt this will be fixed.
Do you have a small testcase demonstrating this problem?
Created attachment 875 [details] Patch taken from bug comments.
Sorry. That attachment was meant for bug 14.
no feedback to testcase query (echo aba | sed -e 's/\(a\(b\)*\)*/\2/' does produce b, but it's not clear to me what should it print instead)
echo aba | sed -e 's/\(a\(b\)*\)*/\2/' should print nothing because the final match of \1 is "a" and \2 must be a substring of \1. Reopening...
... and leaving suspended.
https://bugzilla.redhat.com/show_bug.cgi?id=876766 is another report of this bug.
I think the newly linked Red Hat bug report is more convincing that this needs to be fixed. What is the justification for the "SUSPENDED" status?
(In reply to Rich Felker from comment #9) > I think the newly linked Red Hat bug report is more convincing that this > needs to be fixed. What is the justification for the "SUSPENDED" status? There is no justification, moved to NEW.
The justification for the "suspended" state is that this would be very complicated to fix and wouldn't really add anything to the quality of the implementation. Even if the "(a(b)*)*" case would not be hard to fix, I'm not sure we can say the same of the backreference testcase in the RH bug ('(a(b)*)*\2' matched against 'abab') or the more complicated '(a(b)*)*x\1\2' matched against 'abaxa'. The RH bugzilla was opened by Eric doesn't really add anything to the urgency of this bug; Fedora bugs that also exist upstream can be closed liberally, and that's what I did. The grep bug on Savannah (https://savannah.gnu.org/bugs/?37737) might add something, but it is not clear if the user actually encountered it in a real-world usecase. The same "bug" is present in hardly every regular expression matcher, and I would suggest that the Austin group gives more leeway to implementations. For example, the following rules could work: - it is undefined _which_ occurrence of the sub-RE is captured by a parenthesized group (and matched in backreferences) if the subgroup, or any of its parents, is quantified with + * {}; - the backreference should only match the empty string only if the corresponding sub-RE can be empty, or if the corresponding parenthesized group, or any of its parents, is quantified with * or {0,...}.
> The same "bug" is present in hardly every regular expression matcher, Citation needed. GNU regex is the only implementation I've seen with this bug that purports to be an implementation of POSIX RE. The implementation we have in musl, which is based on TRE, passes the tests in this bug report just fine. As for the RH bug report, the reason I saw it as adding urgency is that it shows how this bug can affect whether or not the regex matches at all, as opposed to "just" a pedantic matter of extraneous data in the resulting subexpression match array.
At the time I reported the bug, I'm pretty sure I checked a few proprietary Unices. Also, even though it is not POSIX RE, Perl regular expressions have the same behavior. While it is true that backreferences can make this bug affect the overall outcome of the match, it is still a very weird regular expression. Thanks for mentioning musl's matcher, I'll check it out. Matching backreferences at a decent speed _and_ obeying the POSIX leftmost/longest rules is very hard to do. The glibc matcher unfortunately is very badly documented and the backreference part of it is basically black magic. :(
On Thu, Jul 04, 2013 at 07:42:51AM +0000, bonzini at gnu dot org wrote: > Unices. Also, even though it is not POSIX RE, Perl regular expressions have > the same behavior. Perl RE's are so different from POSIX that I would not consider them in this discussion. > Thanks for mentioning musl's matcher, I'll check it out. The upstream source we got it from, TRE, is where the magic came from. The author has some good documentation on how it works. > Matching backreferences at a decent speed _and_ obeying the POSIX > leftmost/longest rules is very hard to do. Currently the speed is somewhat less than glibc, but not horrible by any means. A large part of the cost is probably mbtowc. I'm planning to eventually adapt it to compile the UTF-8 RE directly to a byte-based state machine with actual character identity decoding only when matching against character classes, but it's low priority right now.
I'm curious about the speed on testcases with backreferences. The non-backreference part of the bug would be relatively easy to fix in glibc too. The musl source says backreferences "can be spectacularly expensive", and looking at the code I expect that to be true. Unfortunately, GNU sed uses the GNU regex API, not the POSIX one, but it has some "interesting" examples in its testsuite. Perhaps you can try those tests on another sed, compiled against both glibc and musl.
The same issue is present in Solaris 9 and Solaris 11.1 POSIX grep (/usr/xpg4/bin/grep), the only non-GNU greps I have easy access to. $ echo abab | /usr/xpg4/bin/grep '\(a\(b\)*\)*\2' abab Perhaps someone should file a bug report with POSIX folks? It's hard to believe that the current requirement is intended, if so few implementations conform.