Bug 429 - regex hangs on backreferences
Summary: regex hangs on backreferences
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords:
Depends on: 508
Blocks:
  Show dependency treegraph
 
Reported: 2004-10-07 11:00 UTC by Jakub Jelinek
Modified: 2004-11-12 12:22 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
Fixes exponential behavior in check_dst_limits_calc_pos_1 (1.07 KB, patch)
2004-11-10 10:53 UTC, Paolo Bonzini
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2004-10-07 11:00:32 UTC
Just so that it isn't forgotten.
bug-regex11.c still has a few tests commented out that hang glibc regexec. E.g.

#include <regex.h>
#include <stdio.h>

int main ()
{
  regex_t rbuf;
  const char *p;
  int err;
  p = "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?"
      "\\9\\8\\7\\6\\5\\4\\3\\2\\1$";
  if ((err = regcomp  (&rbuf, p, REG_NOSUB | REG_EXTENDED))) {
    char errstr[300];
    regerror (err, &rbuf, errstr, sizeof (errstr));
    puts (errstr);
    return err;
  }
  return regexec (&rbuf, "civic", 0, NULL, 0);
}

takes really too long.
Comment 1 GOTO Masanori 2004-10-07 14:17:10 UTC
memorandum for Jakub.
Comment 2 Paolo Bonzini 2004-11-04 08:43:46 UTC
It does not hang, it has an incredibly high complexity, because a lot of
OP_BACKREF nodes have to be walked on the epsilon closure in
check_dst_limits_calc_pos, before reaching an OP_CLOSE_SUBEXP or OP_OPEN_SUBEXP.

Some simple-minded optimization can be done in check_dst_limits_calc_pos, see
patch at http://sources.redhat.com/ml/libc-alpha/2004-11/msg00018.html (applying
on top of other patches from me from late October and early November).
Comment 3 Jakub Jelinek 2004-11-09 15:24:26 UTC
But even with your patch bug-regex11.c with s/#if 0/#if 1/ eats certainly
more than 10 minutes of CPU time (until I killed it).

Either there is a better algorithm for many backreferences, or we should
consider using NFA for patterns where DFA with backtracing is known to take
too long.
Comment 4 paolo.bonzini@lu.unisi.ch 2004-11-09 15:51:38 UTC
Subject: Re:  regex hangs on backreferences

> But even with your patch bug-regex11.c with s/#if 0/#if 1/ eats certainly
> more than 10 minutes of CPU time (until I killed it).

Yes.  I managed to run a 7-backreference version, which took a few minutes
before my patch.  True, it does not make the full testcase feasible yet, but
I'll work on it.

> Either there is a better algorithm for many backreferences, or we should
> consider using NFA for patterns where DFA with backtracing is known to
take
> too long.

I think you can do some kind of caching to cut the number of invocations of
calc_dst_limits_pos.  Its implementation is naive.  Complexity is
exponential in N because every epsilon closure visits N backreferences
without any remote hope of succeeding, because no OP_{OPEN,CLOSE}_SUBEXP is
on the epsilon closure.

I have to figure out exactly how the backref cache enters the game, because
adding something ad hoc in regcomp.c does not seem the right way to fix it.
And also, I want to understand which cases are common both in practice (to
avoid slowing down the common case) and in the worst case: I'd like
factor.sed and dc.sed to be sped up by 10% while fixing this bug.  I
certainly hope to bring it down to O(N) in the number of backreferences,
albeit with a pretty big constant in front of it.

Paolo

Comment 5 Paolo Bonzini 2004-11-10 10:53:24 UTC
Created attachment 272 [details]
Fixes exponential behavior in check_dst_limits_calc_pos_1

As promised, some caching does most of the job.  The tests in bug-regex11.c now
complete in a sane amount of time (11 seconds each on a G4 PowerMac), but
unluckily this does not speed up other testcases.
Comment 6 Paolo Bonzini 2004-11-12 12:22:28 UTC
Patch applied.