White Box Testing
1. Using SystemTap to validate manual code inspection
Contents
White box testing verifies the internal implementation details of the software under test. As of 2013-05-06 glibc has very little if any white box testing. The general policy has been that we implement standards conforming interfaces and that as such we need to test those interfaces. Testing interfaces is insufficient to discover all classes of errors.
This article discusses white-box testing in glibc using systemtap to inject failures into core routines and validate manual code inspection. The user in this article is not completely fictitious since this work was prompted by real users, but the discussed scenario is fictitious.
1.1. The Problem
A user has reported that they are seeing intermittent crashes in their applications under high memory load.
The crashes all appear to be in glibc.
After some back and forth with the user you are able to get a core file of the crash.
With the application, core file, and debugging symbols you have an excellent start to solving the problem:
[carlos@koi bug]$ gdb -c core.16292 ./user-application GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/carlos/support/bug/user-application...done. [New LWP 16292] Core was generated by `/foo/user-application'. Program terminated with signal 11, Segmentation fault. #0 _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8) at dcigettext.c:1175 1175 newmem->next = transmem_list; (gdb) bt #0 _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8) at dcigettext.c:1175 #1 0x00007f380e6391ab in __dcigettext (domainname=0x1c29250 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5) at dcigettext.c:630 #2 0x0000000000400821 in positive_gettext_test () at nls-test.c:44 #3 0x0000000000400aae in main (argc=2, argv=0x7fffea28b0b8) at nls-test.c:115 (gdb) p newmem $1 = (transmem_block_t *) 0x0 (gdb)
Looking up the source you see:
1167 else 1168 { 1169 malloc_count = 1; 1170 freemem_size = INITIAL_BLOCK_SIZE; 1171 newmem = (transmem_block_t *) malloc (freemem_size); 1172 # ifdef _LIBC 1173 /* Add the block to the list of blocks we have to free 1174 at some point. */ 1175 newmem->next = transmem_list; 1176 transmem_list = newmem; 1177 # endif 1178 } 1179 if (__builtin_expect (newmem == NULL, 0)) 1180 { 1181 freemem = NULL; 1182 freemem_size = 0; 1183 __libc_lock_unlock (lock); 1184 return (char *) -1; 1185 }
In a high memory load situation malloc might fail and newmem->next will segfault if newmem is NULL.
The fix is to add a check for NULL and allow the following check to return -1.
It looks as if the the code had expected malloc to fail, but code had been introduced between the malloc and the use that rendered the check moot.
You apply the following fix:
diff --git a/intl/dcigettext.c b/intl/dcigettext.c index 110307b..18b49b3 100644 --- a/intl/dcigettext.c +++ b/intl/dcigettext.c @@ -1170,10 +1170,13 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp) freemem_size = INITIAL_BLOCK_SIZE; newmem = (transmem_block_t *) malloc (freemem_size); # ifdef _LIBC - /* Add the block to the list of blocks we have to free - at some point. */ - newmem->next = transmem_list; - transmem_list = newmem; + if (newmem != NULL) + { + /* Add the block to the list of blocks we have to free + at some point. */ + newmem->next = transmem_list; + transmem_list = newmem; + } # endif } if (__builtin_expect (newmem == NULL, 0))
You build a new glibc, distribute it to the user, and hope that the bug has been fixed.
You claim that you can't easily test the fix. In a high memory load situation you have no control over where malloc will fail and return NULL.
Time passes and the user returns saying glibc still fails in roughly the same place under high memory load.
The core file reveals the following:
[carlos@koi bug]$ gdb -c core.24446 ./user-application GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/carlos/support/bug/foo/user-application...done. [New LWP 24446] Core was generated by `/foo/user-application'. Program terminated with signal 11, Segmentation fault. #0 __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63 63 while (*haystack && *needle) (gdb) bt #0 __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63 #1 0x00007fab870e2a9c in _nl_find_msg (domain_file=domain_file@entry=0x22732e0, domainbinding=domainbinding@entry=0x2272010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff780e8c78) at dcigettext.c:948 #2 0x00007fab870e31ab in __dcigettext (domainname=0x2273260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5) at dcigettext.c:630 #3 0x0000000000400821 in positive_gettext_test () at nls-test.c:44 #4 0x0000000000400aae in main (argc=2, argv=0x7fff780e8e38) at nls-test.c:115 (gdb)
Looking at the code in question you see the following:
/* Get the header entry. This is a recursion, but it doesn't reallocate domain->conversions because we pass convert = 0. */ nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); if (nullentry != NULL) { const char *charsetstr; charsetstr = strstr (nullentry, "charset=");
When _nl_find_msg returns an error of -1, that error is incorrectly interpreted as a valid string and passed to strstr which crashes. The code in question can't handle _nl_find_msg returning a value of -1, and this is disconcerting since one expects callers, particularly recursive ones, to handle error conditions correctly.
You fix this problem with the following patch:
diff --git a/intl/dcigettext.c b/intl/dcigettext.c index 110307b..0daa508 100644 --- a/intl/dcigettext.c +++ b/intl/dcigettext.c @@ -941,6 +941,11 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp) nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); + /* Resource problems are fatal. If we continue onwards we will + only attempt to calloc a new conv_tab and fail later. */ + if (__builtin_expect (nullentry == (char *) -1, 0)) + return (char *) -1; + if (nullentry != NULL) { const char *charsetstr;
However, now you're wary of the the fact that the error return value of -1 might not be expected by all callers.
You ask yourself if there is any way to simulate a failure at this particular point?
Can you find all the callers of _nl_find_msg and force the result of the call to be -1 and observe how the system fails, fixing any further failures.
There is is a way to inject failures quickly.
1.2. Validation through failure injection
You can inject failures in several ways, including manually under a debugger like gdb. The way which we are going to show here is using systemtap. We demonstrate systemtap because it's easy and because eventually with the dyninst backend you should be able to do this as a fully unprivileged user. (Non-dyninst systemtap has some limited-privilege capabilities too, but require some sysadmin setup.)
You start with the original unpatched build to get comfortable with the process.
To simulate the original failure we need to have malloc return NULL at line 1171.
We use systemtap to see if line 1171 has debug information and which variables are accessible at that point:
stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:*")' | grep 117 process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:1170") $non_reversible:size_t $freemem:unsigned char* $freemem_size:size_t $inbuf:unsigned char const* $outbuf:unsigned char* $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:1176") $newmem:transmem_block_t* $non_reversible:size_t $freemem:unsigned char* $freemem_size:size_t $inbuf:unsigned char const* $outbuf:unsigned char* $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t
We have two lines we could put probe statements on and we decide to use line 1176.
We create the following systemtap script:
probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:1176") { printf("_nl_find_msg: newmem is %d.\n", $newmem); printf("_nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.\n"); $newmem = 0; }
Next we need an application to test it with, and we use the following somewhat non-trivial program and script to set everything up:
#include <libintl.h> #include <locale.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <error.h> #include <errno.h> const struct { const char *msgid; const char *msgstr; } msgs[] = { #define INPUT(Str) { Str, #define OUTPUT(Str) Str }, #include TESTSTRS_H }; const char *catname[] = { [LC_MESSAGES] = "LC_MESSAGES", [LC_TIME] = "LC_TIME", [LC_NUMERIC] = "LC_NUMERIC" }; #define check_setlocale(cat, name) do { \ if (setlocale (cat, name) == NULL) \ { \ printf ("%s:%u: setlocale (%s, \"%s\"): %m\n", \ __FILE__, __LINE__, #cat, name); \ result = 1; \ } \ } while (0) static int positive_gettext_test (void) { size_t cnt; int result = 0; for (cnt = 0; cnt < sizeof (msgs) / sizeof (msgs[0]); ++cnt) { const char *found = gettext (msgs[cnt].msgid); printf ("found \"%s\"\n", found); if (found == NULL || (msgs[cnt].msgstr[0] != '\0' && strcmp (found, msgs[cnt].msgstr) != 0)) { /* Oops, shouldn't happen. */ printf ("\ gettext (\"%s\") failed, returned \"%s\", expected \"%s\"\n", msgs[cnt].msgid, found, msgs[cnt].msgstr); result = 1; } } return result; } int main (int argc, char *argv[]) { int result = 0; char * dom; /* This is the place where the .mo files are placed. */ if (argc > 1) { bindtextdomain ("existing-domain", argv[1]); bindtextdomain ("existing-time-domain", argv[1]); bindtextdomain ("non-existing-domain", argv[1]); } else { printf ("No arg!\n"); exit (1); } /* The locale the catalog is created for is "existing-category". Now set the various variables in question to this value and run the test. */ setenv ("LANGUAGE", "existing-locale", 1); setenv ("LC_ALL", "non-existing-locale", 1); setenv ("LC_MESSAGES", "non-existing-locale", 1); setenv ("LC_CTYPE", "non-existing-locale", 1); setenv ("LANG", "non-existing-locale", 1); check_setlocale (LC_CTYPE, "en_GB.UTF-8"); check_setlocale (LC_MESSAGES, "en_GB.UTF-8"); unsetenv ("OUTPUT_CHARSET"); /* This is the name of the existing domain with a catalog for the LC_MESSAGES category. */ textdomain ("existing-domain"); puts ("test `gettext' with LANGUAGE set"); if (positive_gettext_test () != 0) { puts ("FAILED"); result = 1; } return result; }
The program loads the glibc messages, and translates from the default to en_GB.UTF-8, checking to make sure the translation matches what's expected.
The script expects a glibc build and source directory. The setup script looks like this:
set -e # common_objpfx is a glibc build directory. common_objpfx=/home/carlos/build/glibc-fedora/ test_program_prefix=/home/carlos/support/bug/ # objpfx is the path to the current directory you're building in objpfx=/home/carlos/support/bug/ # srcpfx is a glibc source directory. srcpfx=/home/carlos/src/glibc-fedora/ LC_ALL=C gawk -f ${srcpfx}intl/po2test.awk ${srcpfx}po/en_GB.po > ${objpfx}msgs.h gcc -Wl,-rpath=${common_objpfx} -Wl,--dynamic-linker=${common_objpfx}elf/ld.so -g3 -O0 -DTESTSTRS_H=\"msgs.h\" -o nls-test nls-test.c # Create the locale directories. rm -rf ${objpfx}localedir rm -rf ${objpfx}domaindir mkdir -p ${objpfx}localedir/existing-locale/LC_MESSAGES for f in ADDRESS COLLATE CTYPE IDENTIFICATION MEASUREMENT MONETARY NAME NUMERIC PAPER TELEPHONE TIME; do cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_$f \ ${objpfx}localedir/existing-locale done cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES \ ${objpfx}localedir/existing-locale/LC_MESSAGES # Create the domain directories. mkdir -p ${objpfx}domaindir/existing-locale/LC_MESSAGES mkdir -p ${objpfx}domaindir/existing-locale/LC_TIME # Populate them. msgfmt -o ${objpfx}domaindir/existing-locale/LC_MESSAGES/existing-domain.mo \ -f ${srcpfx}po/en_GB.po msgfmt -o ${objpfx}domaindir/existing-locale/LC_TIME/existing-time-domain.mo \ -f ${srcpfx}po/en_GB.po echo "GCONV_PATH=${common_objpfx}iconvdata \ LOCPATH=${objpfx}localedir:${common_objpfx}localedata \ ${test_program_prefix}nls-test ${objpfx}domaindir > nls-test.out" echo sudo stap -g nls-orig.stp --sysroot=/ \ --sysenv=GCONV_PATH=${common_objpfx}iconvdata \ --sysenv=LOCPATH=${objpfx}localedir:${common_objpfx}localedata \ -c \"${test_program_prefix}nls-test ${objpfx}domaindir\"
And we run it like this:
sudo stap -g nls-step0.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt [sudo] password for carlos: Warning: child process exited with signal 11 (Segmentation fault) WARNING: /usr/bin/staprun exited with status: 1 Pass 5: run failed. [man error::pass5]
To our excitement it fails. This is good since it means we might have reproduced the original problem the customer was seeing under heavy load.
The output text files shows we failed almost immediately with no additional output:
cat out.txt _nl_find_msg: newmem is 8312192. _nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.
We look at the generated core file and we see:
[carlos@koi bug]$ gdb -c core.12100 ./nls-test GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/carlos/support/bugs/nls-test...done. [New LWP 12100] Core was generated by `/home/carlos/support/bugs/nls-test /home/carlos/support/bugs/domaindir'. Program terminated with signal 11, Segmentation fault. #0 _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098) at dcigettext.c:1175 1175 newmem->next = transmem_list; (gdb) bt #0 _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098) at dcigettext.c:1175 #1 0x00007f633a1901ab in __dcigettext (domainname=0x7ea260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5) at dcigettext.c:630 #2 0x0000000000400821 in positive_gettext_test () at nls-test.c:44 #3 0x0000000000400aae in main (argc=2, argv=0x7fff4dceb258) at nls-test.c:115 (gdb)
Success! This is exactly the same failure as the original failure provided by the user.
What next? We know that fixing this bug causes _nl_find_msg to return -1 which is the cause of the next bug. We need to examine all the callers to see if they take into account a return of -1.
1.3. Injecting failures at all call sites
The next step is to find all the call sites of _nl_find_msg, trigger them, and inject failures at the call site, simulating _nl_find_msg returning -1.
grep -rn '_nl_find_msg' * ... intl/dcigettext.c:630: retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen); intl/dcigettext.c:638: retval = _nl_find_msg (domain->successor[cnt], binding, intl/dcigettext.c:740:_nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp) intl/dcigettext.c:942: _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); intl/loadmsgcat.c:784: initialization via _nl_find_msg. We have initialized intl/loadmsgcat.c:1259: nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); intl/gettextP.h:184:char *_nl_find_msg PARAMS ((struct loaded_l10nfile *domain_file,
There are only four locations of interest (the rest are headers, comments or the function istself):
intl/dcigettext.c:630: retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen); intl/dcigettext.c:638: retval = _nl_find_msg (domain->successor[cnt], binding, intl/dcigettext.c:942: _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); intl/loadmsgcat.c:1259: nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
We use the following systemtap command to list a wider set of line numbers we can use for statement events:
stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("*@*dcigettext.c:*")'
We can trigger on line 630, and 632:
process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@/home/carlos/src/glibc-fedora/intl/dcigettext.c:630") $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int $n:long unsigned int $category:int $domain:struct loaded_l10nfile* $categoryvalue:char const* $dirname:char* $xdomainname:char* $saved_errno:int $search:struct known_translation_t $foundp:struct known_translation_t** $localename:char const* $domainname_len:size_t $tree_lock:__libc_rwlock_t process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@/home/carlos/src/glibc-fedora/intl/dcigettext.c:632") $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int $n:long unsigned int $category:int $domain:struct loaded_l10nfile* $categoryvalue:char const* $dirname:char* $xdomainname:char* $retval:char* $saved_errno:int $search:struct known_translation_t $foundp:struct known_translation_t** $localename:char const* $domainname_len:size_t $tree_lock:__libc_rwlock_t
We choose line 632 and set retval to -1 to simulate _nl_find_msg failing in __dcigettext.
We use the following script:
probe process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@*dcigettext.c:632") { printf("_nl_find_msg: retval is %d.\n", $retval); printf("_nl_find_msg: Forcing retval to -1 to simulate _nl_find_msg failing.\n"); $retval = -1; }
Run it like this:
sudo stap -g nls-step1a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt Warning: child process exited with status 1 WARNING: /usr/bin/staprun exited with status: 1 Pass 5: run failed. [man error::pass5]
Looking at out.txt we see that there have been no crashes or segfaults but the translations are not happening and the default locale messages are being returned.
A quick look at the source shows that this is expected:
630 retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen); 631 632 if (retval == NULL) 633 { 634 int cnt; 635 636 for (cnt = 0; domain->successor[cnt] != NULL; ++cnt) 637 { 638 retval = _nl_find_msg (domain->successor[cnt], binding, 639 msgid1, 1, &retlen); 640 641 if (retval != NULL) 642 { 643 domain = domain->successor[cnt]; 644 break; 645 } 646 } 647 } 648 649 /* Returning -1 means that some resource problem exists 650 (likely memory) and that the strings could not be 651 converted. Return the original strings. */ 652 if (__builtin_expect (retval == (char *) -1, 0)) 653 goto no_translation;
Injecting a failure for the first _nl_find_msg call here results in a goto no_translation which is a non-fatal problem.
In fact we can see our next _nl_find_msg call site at line 638.
Unfortunately triggering this next call to _nl_find_msg requires a considerably more complex set of locales with succesors. For the purposes of this article triggering this call to _nl_find_msg was sufficiently difficult that we had to give up. This process is a heuristical approach at covering all of the failure cases and sometimes it costs too much to get 100% coverage. For the sake of long-term maintenance we fix the code by having it eagerly skip translating if _nl_find_msg returns -1 e.g.:
@@ -638,6 +638,11 @@ DCIGETTEXT (domainname, msgid1, msgid2, plural, n, category) retval = _nl_find_msg (domain->successor[cnt], binding, msgid1, 1, &retlen); + /* Resource problems are not fatal, instead we return no + translation. */ + if (__builtin_expect (retval == (char *) -1, 0)) + goto no_translation; + if (retval != NULL) { domain = domain->successor[cnt];
This is the same as the code on line 652 but this is better contained and less likely to be moved around leaving a call site of _nl_find_msg without a check for -1 errors.
Moving on to the third call site at line 942 we check for a line to use as a statement:
stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("*@*dcigettext.c:*")' | grep 944 process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:944") $nullentry:char* $nullentrylen:size_t $encoding:char const* $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t
We mock up a quick failure injection:
probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:944") { printf("_nl_find_msg: nullentry is %d.\n", $nullentry); printf("_nl_find_msg: Forcing nullentry to -1 to simulate _nl_find_msg failing.\n"); $nullentry = -1; }
This failure represents the second failure seen by the user.
Run it like this:
sudo stap -g nls-step2a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt Warning: child process exited with signal 11 (Segmentation fault) WARNING: /usr/bin/staprun exited with status: 1 Pass 5: run failed. [man error::pass5]
Load the core file:
[carlos@koi bug]$ gdb -c core.18236 ./nls-test GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/carlos/support/bug/nls-test...done. [New LWP 18236] Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'. Program terminated with signal 11, Segmentation fault. #0 __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63 63 while (*haystack && *needle) (gdb) bt #0 __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63 #1 0x00007f42a2bada93 in _nl_find_msg (domain_file=domain_file@entry=0x65f2e0, domainbinding=domainbinding@entry=0x65e010, msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff8b9c8388) at dcigettext.c:948 #2 0x00007f42a2bae1ab in __dcigettext (domainname=0x65f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5) at dcigettext.c:630 #3 0x0000000000400821 in positive_gettext_test () at nls-test.c:44 #4 0x0000000000400aae in main (argc=2, argv=0x7fff8b9c8548) at nls-test.c:115 (gdb)
Success! This looks exactly like the second failure reported by the user.
We move on to the last call site line 1259 in intl/loadmsgcat.c.
We choose to set a statement on line 1260, the only available line with debug information nearest to 1259.
We use the following script:
probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_load_domain@*loadmsgcat.c:1260") { printf ("_nl_load_domain: nullentry is '%x'\n", $nullentry); printf ("_nl_load_domain: Forcing nullentry to -1 to simulate _nl_find_msg failure.\n"); $nullentry = -1; }
Run it like this:
sudo stap -g nls-step3a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt [sudo] password for carlos: Warning: child process exited with signal 11 (Segmentation fault) WARNING: /usr/bin/staprun exited with status: 1 Pass 5: run failed. [man error::pass5]
This is an unexpected and new failure. Something the user hasn't reported yet!
Loading the core file in gdb:
[carlos@koi bug]$ gdb -c core.19031 ./nls-test GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/carlos/support/bug/nls-test...done. [New LWP 19031] Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'. Program terminated with signal 11, Segmentation fault. #0 __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63 63 while (*haystack && *needle) (gdb) bt #0 __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63 #1 0x00007fae5c4ada96 in __gettext_extract_plural (nullentry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, pluralp=0x79f4b8, npluralsp=0x79f4c0) at plural-exp.c:109 #2 0x00007fae5c4ab009 in _nl_load_domain (domain_file=domain_file@entry=0x79f2e0, domainbinding=domainbinding@entry=0x79e010) at loadmsgcat.c:1260 #3 0x00007fae5c4aaafd in _nl_find_domain (dirname=dirname@entry=0x79e040 "/home/carlos/support/bug/domaindir", locale=<optimized out>, locale@entry=0x7fffc8e666f0 "existing-locale", domainname=domainname@entry=0x7fffc8e66710 "LC_MESSAGES/existing-domain.mo", domainbinding=domainbinding@entry=0x79e010) at finddomain.c:147 #4 0x00007fae5c4aa18c in __dcigettext (domainname=0x79f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5) at dcigettext.c:626 #5 0x0000000000400821 in positive_gettext_test () at nls-test.c:44 #6 0x0000000000400aae in main (argc=2, argv=0x7fffc8e66988) at nls-test.c:115 (gdb)
The backtrace shows that if _nl_find_msg fails during _nl_load_domain that the -1 error will be treated as a pluralization string and get passed to strstr which will crash.
The fix for this is to abort the loading of the domain if you can't read the pluralization information:
diff --git a/intl/loadmsgcat.c b/intl/loadmsgcat.c index e4b7b38..ac90ed1 100644 --- a/intl/loadmsgcat.c +++ b/intl/loadmsgcat.c @@ -1237,7 +1237,7 @@ _nl_load_domain (domain_file, domainbinding) default: /* This is an invalid revision. */ invalid: - /* This is an invalid .mo file. */ + /* This is an invalid .mo file or we ran out of resources. */ free (domain->malloced); #ifdef HAVE_MMAP if (use_mmap) @@ -1257,6 +1257,11 @@ _nl_load_domain (domain_file, domainbinding) /* Get the header entry and look for a plural specification. */ nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen); + if (__builtin_expect (nullentry == (char *) -1, 0)) + { + __libc_rwlock_fini (domain->conversions_lock); + goto invalid; + } EXTRACT_PLURAL_EXPRESSION (nullentry, &domain->plural, &domain->nplurals); out:
1.4. Conclusion
The final step in the process is to patch the original malloc problem and at a minimum the two call sites (intl/dcigettext.c:942 and intl/loadmsgcat.c:1259) with missing checks for -1 returns from _nl_find_msg. Patching the call site at intl/dcigettext.c:638 is a maintenance issue and should be done to prevent future problems.
At this point you have a patch and you have 4 regression tests that cover all 4 callers of _nl_find_msg.
The regression tests can and should be put back into the glibc testsuite to prevent future changes from regressing the build.
At present running systemtap requires root privileges but with the upcoming dyninst (http://www.dyninst.org/dyninst) backend it should be possible to run all of the aboves tests as a normal user with minimal impact on the running code.
Why didn't we use the systemtap .return probe to capture function returns and adjust the return value of _nl_find_msg? The .return probe isn't supported in the setup I had, otherwise, yes, you can, in a setup that supports it, use the .return probe. I found instrumenting the call sites to be revealing of how the function was used, but in a mass test it would be easier to require .return probe support.