Bug 25934 - re_token_t.mb_partial used before initialization
Summary: re_token_t.mb_partial used before initialization
Status: UNCONFIRMED
Alias: None
Product: glibc
Classification: Unclassified
Component: regex (show other bugs)
Version: 2.27
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-07 00:49 UTC by Steven Li
Modified: 2020-05-07 06:37 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Li 2020-05-07 00:49:10 UTC
I was debugging my own program the the Valgrind/Memcheck tool, and discovered a case of regex using a variable without first initializing it.

The offending code is in regcomp.c near line 328:

*p++ = dfa->nodes[node].opr.c;
	      while (++node < dfa->nodes_len
		     &&	dfa->nodes[node].type == CHARACTER
		     && dfa->nodes[node].mb_partial)
		*p++ = dfa->nodes[node].opr.c;

The Valgrind/Memcheck reported the problem as:

==31536== Conditional jump or move depends on uninitialised value(s)
==31536==    at 0x56F213D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==31536==    by 0x57023F0: __re_compile_fastmap (regcomp.c:282)
==31536==    by 0x57023F0: regcomp (regcomp.c:509)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)
(The rest, plus the line above, is from our own code)

The tool also reported where/how such a variable was allocated (from the heap, not the stack)

==31536==  Uninitialised value was created by a heap allocation
==31536==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31536==    by 0x56F2D6A: create_token_tree.isra.14.constprop.39 (regcomp.c:3749)
==31536==    by 0x56FB133: parse_bracket_exp (regcomp.c:3299)
==31536==    by 0x56FB133: parse_expression (regcomp.c:2262)
==31536==    by 0x56FB519: parse_branch (regcomp.c:2190)
==31536==    by 0x56FB68B: parse_reg_exp (regcomp.c:2138)
==31536==    by 0x56FBD7C: parse (regcomp.c:2107)
==31536==    by 0x56FBD7C: re_compile_internal (regcomp.c:788)
==31536==    by 0x5702331: regcomp (regcomp.c:498)
==31536==    by 0x126EEF: regex_match (xxx_xxx.c:290)

I reviewed the related code, and saw that the "mb_partial" field is part of the following structure (in regex_internal.h):

typedef struct
{
  union
  {
    unsigned char c;                /* for CHARACTER */
    re_bitset_ptr_t sbcset;        /* for SIMPLE_BRACKET */
#ifdef RE_ENABLE_I18N
    re_charset_t *mbcset;        /* for COMPLEX_BRACKET */
#endif /* RE_ENABLE_I18N */
    Idx idx;                        /* for BACK_REF */
    re_context_type ctx_type;        /* for ANCHOR */
  } opr;
#if __GNUC__ >= 2 && !defined __STRICT_ANSI__
  re_token_type_t type : 8;
#else
  re_token_type_t type;
#endif
  unsigned int constraint : 10;        /* context constraint */
  unsigned int duplicated : 1;
  unsigned int opt_subexp : 1;
#ifdef RE_ENABLE_I18N
  unsigned int accept_mb : 1;
  /* These 2 bits can be moved into the union if needed (e.g. if running out
     of bits; move opr.c to opr.c.c and move the flags to opr.c.flags).  */
  unsigned int mb_partial : 1;
#endif
  unsigned int word_char : 1;
} re_token_t;

I then compared the write operation of "mb_partial" against others, such as the near-by "accept_mb" member, and saw that indeed there are times that the "accept_mb" field is set, but the "mb_partial" field is not. For example (in regex_internal.c, near line 1450):

  dfa->nodes[dfa->nodes_len].constraint = 0;
#ifdef RE_ENABLE_I18N
  dfa->nodes[dfa->nodes_len].accept_mb =
    ((token.type == OP_PERIOD && dfa->mb_cur_max > 1)
     || token.type == COMPLEX_BRACKET);
#endif

The code above seems to be performing initialization for the newly allocated "node", but clearly missed the mb_partial field.

My problem occurred with 2.27, but I also read through the latest code, and it seems that mb_partial is not being initialized in additional places. So the problem should still exist in the latest version.

If you need me to validate the problem against the latest glibc version, or provide a program to demonstrate this problem with Valgrind/Memcheck, I'll be happy to do so.
Comment 1 Steven Li 2020-05-07 06:07:36 UTC
OK, I managed to create a simple problem to recreate this problem (on Ubuntu 18.04, using 2.27). The code is super simple:

$ cat a.c
#include <stdio.h>
#include <regex.h>
#include <locale.h>

int main() {
  char * pattern = "^[ab]*(c)$"; // any simpler, the problem goes away
  int flags = REG_ICASE; // has to be there for problem to appear

  setlocale(LC_CTYPE, ""); // without this, there is no problem

  regex_t regex;
  regcomp(&regex, pattern, flags);
}

The interesting thing is with the 1st 3 lines of code, each of them is a necessary condition for the problem. Compiling the code is easy enough:

$ rm a.out; gcc a.c

Running the code under Valgrind yields really interesting/disturbing result (with my home directory name masked out in the messages):

$ valgrind --track-origins=yes --leak-check=yes ./a.out
==12322== Memcheck, a memory error detector
==12322== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12322== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==12322== Command: ./a.out
==12322==
==12322== Conditional jump or move depends on uninitialised value(s)
==12322==    at 0x4F2D13D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==12322==    by 0x4F3D3D0: __re_compile_fastmap (regcomp.c:280)
==12322==    by 0x4F3D3D0: regcomp (regcomp.c:509)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==  Uninitialised value was created by a heap allocation
==12322==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F2DD6A: create_token_tree.isra.14.constprop.39 (regcomp.c:3749)
==12322==    by 0x4F35885: parse_expression (regcomp.c:2356)
==12322==    by 0x4F364CB: parse_branch (regcomp.c:2183)
==12322==    by 0x4F3668B: parse_reg_exp (regcomp.c:2138)
==12322==    by 0x4F36D7C: parse (regcomp.c:2107)
==12322==    by 0x4F36D7C: re_compile_internal (regcomp.c:788)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== Conditional jump or move depends on uninitialised value(s)
==12322==    at 0x4F2D13D: re_compile_fastmap_iter.isra.26 (regcomp.c:328)
==12322==    by 0x4F3D3F0: __re_compile_fastmap (regcomp.c:282)
==12322==    by 0x4F3D3F0: regcomp (regcomp.c:509)
==12322==    by 0x108749: main (in /home/stevel/workspace/TDengine/a.out)
==12322==  Uninitialised value was created by a heap allocation
==12322==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F2DD6A: create_token_tree.isra.14.constprop.39 (regcomp.c:3749)
==12322==    by 0x4F35885: parse_expression (regcomp.c:2356)
==12322==    by 0x4F364CB: parse_branch (regcomp.c:2183)
==12322==    by 0x4F3668B: parse_reg_exp (regcomp.c:2138)
==12322==    by 0x4F36D7C: parse (regcomp.c:2107)
==12322==    by 0x4F36D7C: re_compile_internal (regcomp.c:788)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322==
==12322== HEAP SUMMARY:
==12322==     in use at exit: 2,680 bytes in 48 blocks
==12322==   total heap usage: 82 allocs, 34 frees, 9,003 bytes allocated
==12322==
==12322== 256 bytes in 1 blocks are definitely lost in loss record 35 of 39
==12322==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F3D2C9: regcomp (regcomp.c:479)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== 2,424 (224 direct, 2,200 indirect) bytes in 1 blocks are definitely lost in loss record 39 of 39
==12322==    at 0x4C2FA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4C31D84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12322==    by 0x4F37CDB: re_compile_internal (regcomp.c:749)
==12322==    by 0x4F3D331: regcomp (regcomp.c:498)
==12322==    by 0x108749: main (in [...]/a.out)
==12322==
==12322== LEAK SUMMARY:
==12322==    definitely lost: 480 bytes in 2 blocks
==12322==    indirectly lost: 2,200 bytes in 46 blocks
==12322==      possibly lost: 0 bytes in 0 blocks
==12322==    still reachable: 0 bytes in 0 blocks
==12322==         suppressed: 0 bytes in 0 blocks
==12322==
==12322== For counts of detected and suppressed errors, rerun with: -v
==12322== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)