13637 – incorrect match in multi-byte (non-UTF8) string

Bug 13637 - incorrect match in multi-byte (non-UTF8) string

Summary: incorrect match in multi-byte (non-UTF8) string

Status:	RESOLVED FIXED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	regex (show other bugs)
Version:	2.15

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ulrich Drepper

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-01-31 12:47 UTC by Leonardo Chiquitto
Modified:	2019-04-01 10:34 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
reg.sh: a script to reproduce the problem (587 bytes, application/x-shellscript) 2012-01-31 12:47 UTC, Leonardo Chiquitto	Details
glibc-regex-incomplete-char.patch (251 bytes, patch) 2012-02-10 19:20 UTC, Stanislav Brabec	Details \| Diff
glibc-regex-incomplete-char.patch (2.17 KB, patch) 2012-02-24 20:26 UTC, Stanislav Brabec	Details \| Diff
glibc-regex-incomplete-char.patch (2.25 KB, patch) 2012-02-27 15:13 UTC, Stanislav Brabec	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Leonardo Chiquitto 2012-01-31 12:47:55 UTC

Created attachment 6186 [details]
reg.sh: a script to reproduce the problem

When a special string composed of single and multi-byte characters is passed to re_search(), the function seems to lose track of which characters are multi-byte and returns an incorrect match. This seems to be exclusive to the ja_JP.eucjp locale.

The problem can be reproduced when the following string:

  aaa\xb7\xefa\xbf\xb7\xbd\xe8

... is matched against the pattern:

  \xb7\xbd

The two bytes in the pattern are respectively "the last byte of the second multi-byte char" and "the first byte of the third multi-byte char" in the original string.

The number of "a"s prefixed in the original string seems to make all the difference here. I could only reproduce the problem when exactly 3 or 4 "a"s are prefixed. I.e., if you remove one "a" from the prefix of the original string:

  aa\xb7\xefa\xbf\xb7\xbd\xe8

... the problem no longer happens.

I'm attaching a script that reproduces the problem. The 'sed' version I'm using is compiled with "--without-included-regex", so it should use glibc's regex functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm trying to create a self contained program to demonstrate the problem.

Comment 1 Stanislav Brabec 2012-02-10 19:20:28 UTC

Created attachment 6207 [details]
glibc-regex-incomplete-char.patch

Proposed fix. There is another bug in sed that triggers infinite loop.

Description:

re_search_internal() inside switch(match_kind) in case 6 finds a possible match. In case of our false match, verification of match not respecting multi-byte characters fails and match_regex() returns index of such false match.

Going deeper, re_search_internal() calls re_string_reconstruct() and that calls re_string_skip_chars().

re_string_skip_chars() is a I18N specific function that jumps by characters up to the indexed character. It is a multi-byte character wise function.

In case of correct run, it returns correct index to the next character to inspect. In case of bug occurrence, __mbrtowc called from there returns -2 (incomplete multi-byte character). Why? It seems to be caused by remain_len being equal 1, even if there is still 6 bytes to inspect ("\267\357a\277\267\275").

I believe, that remain_len is computed incorrectly:

sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()

      remain_len = pstr->len - rawbuf_idx;

pstr->len seems to be length of the remaining part of the string, rawbuf_idx is the index of the remaining part of the string in the original (raw) string.

I am not quite familiar with the code, but I believe that the expression should be:
remain_len = pstr->raw_len - rawbuf_idx;


Example:

stop in the first iteration of the re_string_skip_chars()

Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
  raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs
= 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}

Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
  raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>,
mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}


If my observation is correct, the bug is not EUC-JP specific.

Bug triggers:
- Charset must be capable to constitute false match on the boundary of two characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific charset.
- This false match must appear in an exact place near two thirds of the string.

Comment 2 Stanislav Brabec 2012-02-24 20:24:38 UTC

Cross references:

libc-alpha mailing list:
  http://sourceware.org/ml/libc-alpha/2012-02/msg00206.html
  http://sourceware.org/ml/libc-alpha/2012-02/msg00267.html

sed+grep:
  sed: http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00016.html
  sed testcase: http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00017.html
  grep: http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00018.html

Comment 3 Stanislav Brabec 2012-02-24 20:26:54 UTC

Created attachment 6244 [details]
glibc-regex-incomplete-char.patch

New patch with fixed ChangeLog entry and testcase.

Comment 4 Stanislav Brabec 2012-02-27 15:13:21 UTC

Created attachment 6252 [details]
glibc-regex-incomplete-char.patch

Updated patch.

Changes since comment 3:
- Testcase uses test-skeleton.c.
- Uses SBC_MAX and includes "regex_internal.h".
- Setup fastmap before call to re_compile_pattern.
- bug-regex33.6 comment updated: There is one true and one false match.

Comment 5 Stanislav Brabec 2012-02-28 16:42:34 UTC

Fixed by commit 71b5d1c (patch from the comment 4).