White Box Testing

White box testing verifies the internal implementation details of the software under test. As of 2013-05-06 glibc has very little if any white box testing. The general policy has been that we implement standards conforming interfaces and that as such we need to test those interfaces. Testing interfaces is insufficient to discover all classes of errors.

This article discusses white-box testing in glibc using systemtap to inject failures into core routines.

1. The Problem

A user has reported that they are seeing intermittent crashes in their applications under high memory load.

The crashes all appear to be in glibc.

After some back and forth with the user you are able to get a core file of the crash.

With the application, core file, and debugging symbols you have an excellent start to solving the problem:

[carlos@koi bug]$ gdb -c core.16292 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 16292]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175

warning: Source file is more recent than executable.
1175                          newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175
#1  0x00007f380e6391ab in __dcigettext (domainname=0x1c29250 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fffea28b0b8) at nls-test.c:115
(gdb) p newmem
$1 = (transmem_block_t *) 0x0
(gdb)

Looking up the source you see:

                      malloc_count = 1;
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
# ifdef _LIBC
                      /* Add the block to the list of blocks we have to free
                         at some point.  */
                      newmem->next = transmem_list;
                      transmem_list = newmem;
# endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))
                    {
                      freemem = NULL;
                      freemem_size = 0;
                      __libc_lock_unlock (lock);
                      return (char *) -1;
                    }

In a high memory load situation malloc might fail and newmem->next will segfault if newmem is NULL.

The fix is to add a check for NULL and allow the following check to return -1.

It looks as if the the code had expected malloc to fail, but code had been introduced between the malloc and the use that rendered the check moot.

You apply the following fix:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..18b49b3 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -1170,10 +1170,13 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
 # ifdef _LIBC
-                     /* Add the block to the list of blocks we have to free
-                        at some point.  */
-                     newmem->next = transmem_list;
-                     transmem_list = newmem;
+                     if (newmem != NULL)
+                       {
+                         /* Add the block to the list of blocks we have to free
+                            at some point.  */
+                         newmem->next = transmem_list;
+                         transmem_list = newmem;
+                       }
 # endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))

You build a new glibc, distribute it to the user, and hope that the bug has been fixed.

You claim that you can't easily test the fix. In a high memory load situation you have no control over where malloc will fail and return NULL.

Time passes and the user returns saying glibc still fails in roughly the same place under high memory load.

The core file reveals the following:

[carlos@koi bug]$ gdb -c core.24446 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 24446]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
#1  0x00007fab870e2a9c in _nl_find_msg (domain_file=domain_file@entry=0x22732e0, domainbinding=domainbinding@entry=0x2272010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff780e8c78)
    at dcigettext.c:948
#2  0x00007fab870e31ab in __dcigettext (domainname=0x2273260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff780e8e38) at nls-test.c:115
(gdb) 

Looking at the code in question you see the following:

            /* Get the header entry.  This is a recursion, but it doesn't
               reallocate domain->conversions because we pass convert = 0.  */
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

            if (nullentry != NULL)
              {
                const char *charsetstr;

                charsetstr = strstr (nullentry, "charset=");

The code in question can't handle _nl_find_msg returning a value of -1, and this is disconcerting since one expects callers to handle error conditions correctly.