This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Three regex speedups, one of which is actually a bugfix

On Thu, Jan 01, 2004 at 02:01:28PM +0100, Paolo Bonzini wrote:
> This is a set of three patches.
> The first, regex-speedup-uniq-sed.patch, fixes a bug in the definition
> of re_dfastate_t.  I noticed that the uniq.sed testcase in the sed
> testsuite had a different profile than the other backreference
> stress tests, with re_acquire_state_context accounting for 50% of
> the run time due to a quadratic bottleneck.  The reason was that
> the searched state's context did not fit in the hash table's context
> field, so each call created a new state!  Fixing the definition of
> re_dfa_state_t trims 97% of re_acquire_state_context's run time,
> and 50% of the total run time, for this particular testcase.

Why context : 10? 4 bits are enough IMHO.

> For the third, regex-cache-word-char.patch, I noticed an unused word_char
> bitset in regex_t; instead of zapping it, I tried caching the result of

??? dfa->word_char is definitely not unused.  Initialized by init_word_char (),
used in group_nodes_into_DFAstates.
It is not initialized always, because in the common case there is no
\<, \>, \b, \B, \w and \W in regular expression and so differentiating
between word and non-word characters is not needed at all.
I actually had in my tree a patch to only call IS_WIDE_WORD_CHAR (wc) in
re_string_context_at if dfa->word_char != NULL, but it didn't show up high
in my profiles (note that iswalnum is way more expensive than isalnum), so I
haven't submitted it.

> IS_WORD_CHAR, and it resulted in a consistent 3-6% speedup with
> __ctype_b_loc disappearing from the profile.  The patch also moves the fetch of
> "preg->newline_anchor" in re_string_context_at rather than in the caller, since
> "preg" had to be passed nevertheless, and this contributes to the speedup.

If fetching preg->newline_anchor contributes to the speedup, then that
argument should be removed, not changed.
I see no problem in adding unsigned char newline_anchor field to
re_string_t and initialize it in re_search_internal and similarly
re_bitset_ptr_t word_char (this would be copied from dfa->word_char
at re_string_allocate), there is just one re_string_t object for regcomp etc.
and one for regexec etc. and neither preg->newline_anchor nor
dfa->word_char is ever changing after string compilation.

Now, either we can have word_char == NULL mean word/non-word doesn't matter,
thus IS_WORD_CHAR in code called by regexec would be:
input->word_char == NULL ? 0 : bitset_contain (input->word_char, ch)
input->word_char == NULL ? 0 : IS_WIDE_WORD_CHAR (wc)
Or we can allocate the word_char bitmap always (then make the actual bitset
a field of re_dfa_t to avoid the additional malloc), but initialize it only
when needed with init_word_char.  Thus, in the usual case the bitset would
be all zeros.  Then the tests would be:
bitset_contain (input->word_char, ch)
This is bad for mb though, because iswalnum is expensive.  But perhaps we
could add another boolean field to re_dfa_t (there it can be a bitfield)
and re_string_t (better at least a char for speed reasons), which would be
false initially and set to true in init_word_char ().
Then re_string_context_at can do:
bitset_contain (input->word_char, ch)
iinput->word_char_set ? IS_WIDE_WORD_CHAR (wc) : 0


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]