This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
[Bug runtime/11334] regular expression string matching
- From: "smakarov at redhat dot com" <sourceware-bugzilla at sourceware dot org>
- To: systemtap at sourceware dot org
- Date: Wed, 10 Oct 2012 16:21:50 +0000
- Subject: [Bug runtime/11334] regular expression string matching
- Auto-submitted: auto-generated
- References: <bug-11334-6586@http.sourceware.org/bugzilla/>
http://sourceware.org/bugzilla/show_bug.cgi?id=11334
Serguei Makarov <smakarov at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |smakarov at redhat dot com
--- Comment #1 from Serguei Makarov <smakarov at redhat dot com> 2012-10-10 16:21:50 UTC ---
Evaluating use of re2c as the basis for our regex translator, according to a
mailing list recommendation. (see http://re2c.org) Some rough thoughts as to
its suitability follow:
Difficulty integrating into systemtap:
There are some fiddly points here. First of all, re2c has a flex-like interface
where it simply preprocesses a C file searching for re2c invocations. Each
invocation consists of a set of alternate regular expressions with associated
actions at the end. The actions don't access numbered subexpressions; to get
the actual matched content we need to do a bunch of hackery where we save the
cursor position at the start of the match, and use that plus the position at
the end to get the entire matched string. The lessons/ and examples/ folders of
the re2c distribution give a good idea of how this works in practice.
Thus we need to hack the engine to instead generate code that can support a
scheme where we run a match EXPR =~ REGEXP and perhaps then extract numbered
subexpressions as substrings from the original string. This is possible to code
into re2c if we figure out how we want to keep track of subexpression indices
during matching, and define a new action type for the DFA engine, to be
associated with ( ) brackets, whose generated code saves the indices at the
appropriate point.
Robustness as kernel code:
Fairly optimistic about this part. As it stands now, re2c simply emits the DFA
in the form of a series of case statements. (This relies on the magic of gcc to
optimize the resulting spaghetti down into a compact form, which probably
stands up well next to any hand-coded optimizations we might feel inclined to
implement on a DFA table for a project of this scope.)
Actual manipulation / reading of the string is done by calling out to
user-provided macros:
YYCTYPE -- character type, e.g. char or unsigned char or whatnot
YYCURSOR -- gives pointer to current character in buffer
YYLIMIT -- gives the position of the end of the buffer
YYMARKER -- stores (limited) backtracking info for the special case where the
regex has an optional postfix that may or may not be matched.
YYFILL() -- used when re2c runs out of buffer data, to request the buffer to be
refilled/updated. If we are matching the string in-place, this is a no-op, and
(unlike other re2c use cases) we do not have to worry about YYMARKER remaining
consistent with the state of the buffer.
The only thing re2c assumes is that there is a buffer of consecutive characters
used for the matching. Hence, by carefully coding the hooks above, the
resulting "runtime" should be robust enough for in-kernel use.
{unresolved questions -- while robust, can the matching go on for too long?
given that strings are limited by MAXSTRINGLEN, for instance?}
Conformity to a standard such as POSIX ERE:
Almost, but not quite there. As far as I can tell, re2c does not support named
character classes, and does something weird and non-standard with single and
double quotes. (Double quotes quote stuff, but single quotes quote stuff and
make it case insensitive?)
This is not at all difficult to remedy as long as the rest of re2c is compliant
(though we'd need someone else's testsuite to be sure of that -- see "testing
testing testing" below).
License/maintenance issues:
The project is public domain, so I don't think there will be a problem
incorporating it into SystemTap. I am more uncertain about how to proceed with
contributing or separately maintaining the inevitable changes and extensions
needed for SystemTap use.
The project is fairly stagnant but not completely abandoned. (The home page
mentions that it had been abandoned once and re-adopted.) While there are very
occasional mailing list inquiries, the last commit to the repositories was in
2011. I notice that several people using the project seem to have made
disconnected GitHub forks to hack on it rather than going to the maintainers,
which is also not an encouraging sign.
Difficulty of rewriting almost-completely-from-scratch if we decide that doing
so would better fit our needs:
Not too high, as far as I can tell. sloccount gives ~10,000 lines for the whole
project, a decided plurality of which is either fluff we don't need (e.g. their
C preprocessor), or we must replace with our own stuff anyways (e.g. the
libre2c component is completely irrelevant to our needs), or stuff that in the
event of a rewrite would be redundant infrastructure that SystemTap already
provides.
Either way, testing testing testing:
Probably the most valuable component of any regex engine to us, at least the
one I'd have most difficulty recreating on my own, would be the testsuite :)
Note that re2c's testsuite is not ideal, since its method of invocation
diverges from the EXPR=~REGEXP mechanism, and it doesn't implement POSIX ERE.
We'd probably need to look into cribbing a testsuite from GNU for full
satisfaction and coding re2c (or our own roughly-equivalent engine) to
successfully test against it.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.