Bug 12811 - regexec/re_search consumes huge amounts of memory
Summary: regexec/re_search consumes huge amounts of memory
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: 2.13
: P2 normal
Target Milestone: ---
Assignee: Ulrich Drepper
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-26 15:29 UTC by Emil Wojak
Modified: 2014-06-13 10:53 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
Fix for huge memory usage (652 bytes, patch)
2011-05-26 15:29 UTC, Emil Wojak
Details | Diff
A patch for optimal size of internal buffers. (444 bytes, patch)
2011-05-26 15:30 UTC, Emil Wojak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Emil Wojak 2011-05-26 15:29:15 UTC
Created attachment 5753 [details]
Fix for huge memory usage

The bug is triggered under the following circumstances:
- multibyte character encoding, like pl_PL.UTF-8
- either translation table is used or RE_ICASE flag is set
- input buffer which ends with a UTF-8 character cut in the middle, ex. aaaaaaaaaaaa\xc4
- specific kind of regex, that does not match the input buffer, and that re_search would apply starting at each position of the input buffer ex. [^b]*ab or simply .*ab

Here's a sample program that consumes 1.4 GB on 32-bit architecture and 5.2 GB on 64-bit machines (measured with valgrind --tool=massif).

#include <regex.h>
#include <locale.h>

int main(void) {
        regex_t preg;
        setlocale(LC_CTYPE, "en_US.UTF-8");
        regcomp(&preg, ".*ab", REG_ICASE);
        regexec(&preg, "aaaaaaaaaaaa\xc4", 0, NULL, 0);
        regfree(&preg);
        return 0;
}


The exhaustive memory usage is caused by calling extend_buffers with each re_search_internal iteration, even though internal buffers already are long enough to hold the whole string. When matching procedure reaches mctx->input.valid_len, internal buffer size is doubled and the rest of the input buffer is converted to wchar_t, except for the last byte, which is a UTF-8 character cut in the middle. This last character is never converted, because it's continuation never comes, but still internal buffers are needlessly doubled.
A patch solving this problem is attached.

There's another issue. Once the internal buffers are long enough to hold at least half of the input buffer, they shouldn't get doubled, because that's a waste of memory. Instead it's enough to extend them to the actual length of the input buffer. This can save significant amounts of memory for long input buffers.
A patch for this issue is attached as well.
Comment 1 Emil Wojak 2011-05-26 15:30:14 UTC
Created attachment 5754 [details]
A patch for optimal size of internal buffers.
Comment 2 Paolo Bonzini 2011-05-26 15:32:45 UTC
The patches look good apart from extra-long lines.  Thanks.
Comment 3 Ulrich Drepper 2011-05-28 21:17:21 UTC
The patches missed the crucial change which fixed this specific problem.  I added the patches and then this one additional test.