This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug runtime/11334] regular expression string matching


http://sourceware.org/bugzilla/show_bug.cgi?id=11334

--- Comment #2 from Serguei Makarov <smakarov at redhat dot com> 2012-10-11 19:58:37 UTC ---
Okay, irc discussions seem to favour an "adapt and assimilate" approach for the
existing re2c code. That leaves the SystemTap side of things to figure out.

Here's a rough sketch of the proposed in-language regex interface.

str =~ pat -- regex matching operator
str !~ pat -- regex non-matching operator

matched:str(n:long) -- n'th subgroup of most recent match

matched_start:long(n:long)
matched_end:long(n:long)   -- start and end indices of subgroup

The functions can be actual functions in string.stp, implemented in
embedded-C, which access match information stored in the probe
context.

* * *

We want to avoid generating several copies of the same code for the
same regex used in more than one place. One reasonable solution might
be to allow definition of globally available regexes, as in:

global ident_re = /[_a-zA-Z][_a-zA-Z0-9]*/

All of the uses of indent_re then refer to the same matching code.

Another strategy (both might be worth using) is to keep a table of
declared regex patterns to detect when the same regex appears more
than once.

Another (more involved) idea for an optimization is to check whether we need to
save subgroup locations for a given match (or just compute a binary yes/no
answer without the need to do so). This might be handled by something as simple
as a pragma marking the matched_* tapset functions to signal that we need
subgroup saving, somewhat similar to how the existing /* pragma:unwind */
works.

* * *

In the long term, if we want other features such as substitution, we
might either use the Perl str =~ s/pat/sub/ syntax conventions, or we
might hack up the translator to detect some function-call-like construct
such as str_replace(prnt_str,/pat/,/rplc_pat/) and substitute the appropriate
generated code for that.

* * *

Short and incomplete list of files that will be affected:

* regcomp.cxx -- NEW file interfacing with (and eventually assimilating) the
re2c DFA compiler
  - defines a table of regular expressions used by the program (stored in the
session object)
  - action: register a new regular expression (save it in the table and compile
to internal DFA representation)
  - action: emit C code for given DFA (called at the appropriate point from
translate.cxx)

* staptree.cxx
* parse.cxx
  - support for parsing and internally representing the =~ operator
* elaborate.cxx
  - after appropriate semantic checks & code elision are done, register all of
the surviving regexes using the regcomp.cxx interface
* translate.cxx 
  - call regcomp.cxx code to emit regex matching subroutines
  - replace =~ with appropriate regex invocations

* string.stp
  - access the regex match data after the fact

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]