This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[PATCH] fix false multi-byte matches in some regular expressions
- From: Stanislav Brabec <sbrabec at suse dot cz>
- To: libc-alpha at sourceware dot org
- Date: Fri, 10 Feb 2012 21:16:16 +0100
- Subject: [PATCH] fix false multi-byte matches in some regular expressions
In some charsets, strings may sometimes form false matches on a boundary
of two characters (e. g. EUC-JP) replacements. The regexp evaluation of
false multi-byte matches contains a bug that may cause accepting of such
false match as real. It can cause false replacement or even trigger
infinite loop in sed (due to another bug in sed).
re_search_internal() inside switch(match_kind) in case 6 finds a
possible match. In case of our false match, verification of match not
respecting multi-byte characters fails and match_regex() returns index
of such false match.
Going deeper, re_search_internal() calls re_string_reconstruct() and
that calls re_string_skip_chars().
re_string_skip_chars() is a I18N specific function that jumps by
characters up to the indexed character. It is a multi-byte character
wise function.
In case of correct run, it returns correct index to the next character
to inspect. In case of bug occurrence, __mbrtowc called from there
returns -2 (incomplete multi-byte character). Why? It seems to be caused
by remain_len being equal 1, even if there is still 7 bytes to inspect
("\267\357a\277\267\275\350").
I believe, that remain_len is computed incorrectly:
sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()
remain_len = pstr->len - rawbuf_idx;
pstr->len seems to be length of the remaining part of the string,
rawbuf_idx is the index of the remaining part of the string in the
original (raw) string.
I am not quite familiar with the code, but I believe that the expression
should be:
remain_len = pstr->raw_len - rawbuf_idx;
Example:
stop in the first iteration of the re_string_skip_chars()
Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs = 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>,
wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
__wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2,
valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2,
raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0,
trans = 0x0, word_char = 0x647d88, icase = 0 '\000',
is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000',
offsets_needed = 0 '\000', newline_anchor = 0 '\000',
word_ops_used = 0 '\000', mb_cur_max = 3}
Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>, mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>,
wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
__wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3,
valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2,
raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0,
trans = 0x0, word_char = 0x647d88, icase = 0 '\000',
is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000',
offsets_needed = 0 '\000', newline_anchor = 0 '\000',
word_ops_used = 0 '\000', mb_cur_max = 3}
If my observation is correct, the bug is not EUC-JP specific.
Bug triggers:
- Charset must be capable to constitute false match on the boundary of
two characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific
charset.
- This false match must appear in an exact place near two thirds of the
string.
References:
glibc bugzilla:
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
sed+grep:
http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00016.html
http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00017.html
(This one contains sed testsuite.)
http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00018.html
Index: glibc-2.15/posix/regex_internal.c
===================================================================
--- glibc-2.15.orig/posix/regex_internal.c
+++ glibc-2.15/posix/regex_internal.c
@@ -500,7 +500,7 @@ re_string_skip_chars (re_string_t *pstr,
rawbuf_idx < new_raw_idx;)
{
wchar_t wc2;
- int remain_len = pstr->len - rawbuf_idx;
+ int remain_len = pstr->raw_len - rawbuf_idx;
prev_st = pstr->cur_state;
mbclen = __mbrtowc (&wc2, (const char *) pstr->raw_mbs + rawbuf_idx,
remain_len, &pstr->cur_state);
--
Best Regards / S pozdravem,
Stanislav Brabec
software developer
---------------------------------------------------------------------
SUSE LINUX, s. r. o. e-mail: sbrabec@suse.cz
Lihovarskà 1060/12 tel: +49 911 7405384547
190 00 Praha 9 fax: +420 284 028 951
Czech Republic http://www.suse.cz/