Bug 13637 - incorrect match in multi-byte (non-UTF8) string
Summary: incorrect match in multi-byte (non-UTF8) string
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: 2.15
: P2 normal
Target Milestone: ---
Assignee: Ulrich Drepper
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-31 12:47 UTC by Leonardo Chiquitto
Modified: 2019-04-01 10:34 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
fweimer: security-


Attachments
reg.sh: a script to reproduce the problem (587 bytes, application/x-shellscript)
2012-01-31 12:47 UTC, Leonardo Chiquitto
Details
glibc-regex-incomplete-char.patch (251 bytes, patch)
2012-02-10 19:20 UTC, Stanislav Brabec
Details | Diff
glibc-regex-incomplete-char.patch (2.17 KB, patch)
2012-02-24 20:26 UTC, Stanislav Brabec
Details | Diff
glibc-regex-incomplete-char.patch (2.25 KB, patch)
2012-02-27 15:13 UTC, Stanislav Brabec
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Leonardo Chiquitto 2012-01-31 12:47:55 UTC
Created attachment 6186 [details]
reg.sh: a script to reproduce the problem

When a special string composed of single and multi-byte characters is passed to re_search(), the function seems to lose track of which characters are multi-byte and returns an incorrect match. This seems to be exclusive to the ja_JP.eucjp locale.

The problem can be reproduced when the following string:

  aaa\xb7\xefa\xbf\xb7\xbd\xe8

... is matched against the pattern:

  \xb7\xbd

The two bytes in the pattern are respectively "the last byte of the second multi-byte char" and "the first byte of the third multi-byte char" in the original string.

The number of "a"s prefixed in the original string seems to make all the difference here. I could only reproduce the problem when exactly 3 or 4 "a"s are prefixed. I.e., if you remove one "a" from the prefix of the original string:

  aa\xb7\xefa\xbf\xb7\xbd\xe8

... the problem no longer happens.

I'm attaching a script that reproduces the problem. The 'sed' version I'm using is compiled with "--without-included-regex", so it should use glibc's regex functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm trying to create a self contained program to demonstrate the problem.
Comment 1 Stanislav Brabec 2012-02-10 19:20:28 UTC
Created attachment 6207 [details]
glibc-regex-incomplete-char.patch

Proposed fix. There is another bug in sed that triggers infinite loop.

Description:

re_search_internal() inside switch(match_kind) in case 6 finds a possible match. In case of our false match, verification of match not respecting multi-byte characters fails and match_regex() returns index of such false match.

Going deeper, re_search_internal() calls re_string_reconstruct() and that calls re_string_skip_chars().

re_string_skip_chars() is a I18N specific function that jumps by characters up to the indexed character. It is a multi-byte character wise function.

In case of correct run, it returns correct index to the next character to inspect. In case of bug occurrence, __mbrtowc called from there returns -2 (incomplete multi-byte character). Why? It seems to be caused by remain_len being equal 1, even if there is still 6 bytes to inspect ("\267\357a\277\267\275").

I believe, that remain_len is computed incorrectly:

sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()

      remain_len = pstr->len - rawbuf_idx;

pstr->len seems to be length of the remaining part of the string, rawbuf_idx is the index of the remaining part of the string in the original (raw) string.

I am not quite familiar with the code, but I believe that the expression should be:
remain_len = pstr->raw_len - rawbuf_idx;


Example:

stop in the first iteration of the re_string_skip_chars()

Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
  raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs
= 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}

Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
  raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>,
mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}


If my observation is correct, the bug is not EUC-JP specific.

Bug triggers:
- Charset must be capable to constitute false match on the boundary of two characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific charset.
- This false match must appear in an exact place near two thirds of the string.
Comment 3 Stanislav Brabec 2012-02-24 20:26:54 UTC
Created attachment 6244 [details]
glibc-regex-incomplete-char.patch

New patch with fixed ChangeLog entry and testcase.
Comment 4 Stanislav Brabec 2012-02-27 15:13:21 UTC
Created attachment 6252 [details]
glibc-regex-incomplete-char.patch

Updated patch.

Changes since comment 3:
- Testcase uses test-skeleton.c.
- Uses SBC_MAX and includes "regex_internal.h".
- Setup fastmap before call to re_compile_pattern.
- bug-regex33.6 comment updated: There is one true and one false match.
Comment 5 Stanislav Brabec 2012-02-28 16:42:34 UTC
Fixed by commit 71b5d1c (patch from the comment 4).