This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: [re2c-devel] interest in a tagged-DFA implementation?
- From: Serguei Makarov <smakarov at redhat dot com>
- To: Dan Nuffer <danielnuffer at gmail dot com>, re2c-devel at lists dot sourceforge dot net
- Cc: systemtap at sourceware dot org
- Date: Fri, 21 Jun 2013 10:59:38 -0400 (EDT)
- Subject: Re: [re2c-devel] interest in a tagged-DFA implementation?
- References: <1400681454 dot 10880648 dot 1371671131989 dot JavaMail dot root at redhat dot com> <1062272056 dot 10883225 dot 1371671617359 dot JavaMail dot root at redhat dot com> <CACoZW28YgO2kf=i075oh0uvwfP=+8ifqv0J8jguWMAPapQpiVA at mail dot gmail dot com>
> Hi,
> That looks like a useful feature for some uses. Do you know how much
> overhead it imposes for the generated scanner? I'm also curious how big of
> a change it would be to the re2c code? Would it be better to have more than
> one generator or just an option to do something slightly different?
Hi Dan,
Thank you for the quick reply.
Questions of complexity are discussed in Section 4 of the paper I linked (http://laurikari.net/ville/spire2000-tnfa.pdf).
The simplest way to implement the TDFA algorithm involves keeping a table of (# of tag operations used) x (# of states in the original NFA) position values. Each DFA transition is then marked with a number of assignment operations that modify this table, reordering some existing tag values, and assigning the current position in the string to some others. (A tag operation is an instruction to save the current position in the string, so one is used to mark the start and end of each subexpression.) Thus the overhead is very low for a simple application such as 'just retrieve the entirety of the matched string', but piles up a bit if we use many subexpressions inside complicated constructs, though the algorithm *always* remains linear in the actual length of the input string.
(For this calculation, "# of states in the original NFA" is not the full number of Ins instructions, but something more like the number of states that would exist if a more standard NFA representation was used -- i.e. with fewer implicit e-transitions. The algorithm is not difficult to implement in a way that generates the same results from an NFA no matter how many untagged e-transitions are used to effectively split states into simpler Ins nodes.)
Enabling tags makes re2c's performance somewhat more uneven depending on the regexp, with rare pathological cases that involve rewriting the entire table during one transition, though this is still not as bad as some standard regexp implementations (which have features that can produce nonlinear behaviour in the length of the input).
---
What I think would fit most closely with the re2c philosophy is having an _optional_ YYTAG() primitive of some sort which allows the user to specify how the storage of tag values is implemented. I am not very familiar with the libre2c side of things, since that portion of the codebase was irrelevant for the project I'm doing, but presumably a 'standard' (good enough for most regexps) implementation of YYTAG can then be added there.
The tags have to be kept track of throughout the algorithm, but this does not preclude making them an entirely optional feature along the lines of YYFILL() where returning to the non-YYFILL algorithm is mostly a matter of omitting statements from the generated DFA. Thus code changes would perhaps consist in defining a new TAG opcode for Ins, the addition of fields to the DFA class, addition of tag calculations throughout the algorithm (guarded by if(TFlag) { ... } or such), which doesn't change how the algorithm works if we don't enable tags, but does touch a lot of places in the engine.
All the best,
Serhei Makarov