Differences between revisions 3 and 4
Revision 3 as of 2013-05-06 04:15:32
Size: 7721
Comment:
Revision 4 as of 2013-05-06 05:27:28
Size: 11955
Comment:
Deletions are marked like this. Additions are marked like this.
Line 156: Line 156:
The code in question can't handle `_nl_find_msg` returning a value of `-1`, and this is disconcerting since one expects callers to handle error conditions correctly. When `_nl_find_msg` returns an error of `-1`, that error is incorrectly interpreted as a valid string and passed to `strstr` which crashes. The code in question can't handle `_nl_find_msg` returning a value of `-1`, and this is disconcerting since one expects callers, particularly recursive ones, to handle error conditions correctly.

You fix this problem with the following patch:
{{{
diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..0daa508 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -941,6 +941,11 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
 
+ /* Resource problems are fatal. If we continue onwards we will
+ only attempt to calloc a new conv_tab and fail later. */
+ if (__builtin_expect (nullentry == (char *) -1, 0))
+ return (char *) -1;
+
            if (nullentry != NULL)
              {
                const char *charsetstr;
}}}

However, now you're wary of the the fact that the error return value of `-1` might not be expected by all callers.

You ask yourself if there is any way to simulate a failure at this particular point?

Can you find all the callers of `_nl_find_msg` and force the result of the call to be `-1` and observe how the system fails, fixing any further failures.

There is is a way to inject failures quickly.

== Injecting failures ==

You can inject failures in several ways, including manually under a debugger like gdb. The way which we are going to show here is using systemtap. We demonstrate systemtap because it's easy and because eventually with the dyninst backend you should be able to do this as a non-root user.

You start with the original unpatched build to get comfortable with the process.

To simulate the original failure we need to have `malloc` return `NULL` at line `1192`.

We use systemtap to see if line `1192` has debug information and which variables are accessible at that point:
{{{

}}}


TODO:
- Finish write up :-)

{{{
[carlos@koi bug]$ gdb -c core.30459 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 30459]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0 __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fac3c77b96e "plural=") at ../string/strstr.c:63
63 while (*haystack && *needle)
(gdb) bt
#0 __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    needle_start=needle_start@entry=0x7fac3c77b96e "plural=") at ../string/strstr.c:63
#1 0x00007fac3c647aa6 in __gettext_extract_plural (nullentry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
    pluralp=0xc524b8, npluralsp=0xc524c0) at plural-exp.c:109
#2 0x00007fac3c645019 in _nl_load_domain (domain_file=domain_file@entry=0xc522e0, domainbinding=domainbinding@entry=0xc51010)
    at loadmsgcat.c:1260
#3 0x00007fac3c644b0d in _nl_find_domain (dirname=dirname@entry=0xc51040 "/home/carlos/support/BZ957089/domaindir", locale=<optimized out>,
    locale@entry=0x7fff80ed5d30 "existing-locale", domainname=domainname@entry=0x7fff80ed5d50 "LC_MESSAGES/existing-domain.mo",
    domainbinding=domainbinding@entry=0xc51010) at finddomain.c:147
#4 0x00007fac3c64419c in __dcigettext (domainname=0xc52260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:626
#5 0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#6 0x0000000000400aae in main (argc=2, argv=0x7fff80ed5fc8) at nls-test.c:115
(gdb)
}}}

White Box Testing

White box testing verifies the internal implementation details of the software under test. As of 2013-05-06 glibc has very little if any white box testing. The general policy has been that we implement standards conforming interfaces and that as such we need to test those interfaces. Testing interfaces is insufficient to discover all classes of errors.

This article discusses white-box testing in glibc using systemtap to inject failures into core routines.

1. The Problem

A user has reported that they are seeing intermittent crashes in their applications under high memory load.

The crashes all appear to be in glibc.

After some back and forth with the user you are able to get a core file of the crash.

With the application, core file, and debugging symbols you have an excellent start to solving the problem:

[carlos@koi bug]$ gdb -c core.16292 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 16292]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175

warning: Source file is more recent than executable.
1175                          newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175
#1  0x00007f380e6391ab in __dcigettext (domainname=0x1c29250 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fffea28b0b8) at nls-test.c:115
(gdb) p newmem
$1 = (transmem_block_t *) 0x0
(gdb)

Looking up the source you see:

                      malloc_count = 1;
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
# ifdef _LIBC
                      /* Add the block to the list of blocks we have to free
                         at some point.  */
                      newmem->next = transmem_list;
                      transmem_list = newmem;
# endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))
                    {
                      freemem = NULL;
                      freemem_size = 0;
                      __libc_lock_unlock (lock);
                      return (char *) -1;
                    }

In a high memory load situation malloc might fail and newmem->next will segfault if newmem is NULL.

The fix is to add a check for NULL and allow the following check to return -1.

It looks as if the the code had expected malloc to fail, but code had been introduced between the malloc and the use that rendered the check moot.

You apply the following fix:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..18b49b3 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -1170,10 +1170,13 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
 # ifdef _LIBC
-                     /* Add the block to the list of blocks we have to free
-                        at some point.  */
-                     newmem->next = transmem_list;
-                     transmem_list = newmem;
+                     if (newmem != NULL)
+                       {
+                         /* Add the block to the list of blocks we have to free
+                            at some point.  */
+                         newmem->next = transmem_list;
+                         transmem_list = newmem;
+                       }
 # endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))

You build a new glibc, distribute it to the user, and hope that the bug has been fixed.

You claim that you can't easily test the fix. In a high memory load situation you have no control over where malloc will fail and return NULL.

Time passes and the user returns saying glibc still fails in roughly the same place under high memory load.

The core file reveals the following:

[carlos@koi bug]$ gdb -c core.24446 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 24446]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
#1  0x00007fab870e2a9c in _nl_find_msg (domain_file=domain_file@entry=0x22732e0, domainbinding=domainbinding@entry=0x2272010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff780e8c78)
    at dcigettext.c:948
#2  0x00007fab870e31ab in __dcigettext (domainname=0x2273260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff780e8e38) at nls-test.c:115
(gdb) 

Looking at the code in question you see the following:

            /* Get the header entry.  This is a recursion, but it doesn't
               reallocate domain->conversions because we pass convert = 0.  */
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

            if (nullentry != NULL)
              {
                const char *charsetstr;

                charsetstr = strstr (nullentry, "charset=");

When _nl_find_msg returns an error of -1, that error is incorrectly interpreted as a valid string and passed to strstr which crashes. The code in question can't handle _nl_find_msg returning a value of -1, and this is disconcerting since one expects callers, particularly recursive ones, to handle error conditions correctly.

You fix this problem with the following patch:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..0daa508 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -941,6 +941,11 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
 
+           /* Resource problems are fatal.  If we continue onwards we will
+              only attempt to calloc a new conv_tab and fail later.  */
+           if (__builtin_expect (nullentry == (char *) -1, 0))
+             return (char *) -1;
+
            if (nullentry != NULL)
              {
                const char *charsetstr;

However, now you're wary of the the fact that the error return value of -1 might not be expected by all callers.

You ask yourself if there is any way to simulate a failure at this particular point?

Can you find all the callers of _nl_find_msg and force the result of the call to be -1 and observe how the system fails, fixing any further failures.

There is is a way to inject failures quickly.

2. Injecting failures

You can inject failures in several ways, including manually under a debugger like gdb. The way which we are going to show here is using systemtap. We demonstrate systemtap because it's easy and because eventually with the dyninst backend you should be able to do this as a non-root user.

You start with the original unpatched build to get comfortable with the process.

To simulate the original failure we need to have malloc return NULL at line 1192.

We use systemtap to see if line 1192 has debug information and which variables are accessible at that point:

TODO: - Finish write up :-)

[carlos@koi bug]$ gdb -c core.30459 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 30459]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fac3c77b96e "plural=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fac3c77b96e "plural=") at ../string/strstr.c:63
#1  0x00007fac3c647aa6 in __gettext_extract_plural (nullentry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    pluralp=0xc524b8, npluralsp=0xc524c0) at plural-exp.c:109
#2  0x00007fac3c645019 in _nl_load_domain (domain_file=domain_file@entry=0xc522e0, domainbinding=domainbinding@entry=0xc51010)
    at loadmsgcat.c:1260
#3  0x00007fac3c644b0d in _nl_find_domain (dirname=dirname@entry=0xc51040 "/home/carlos/support/BZ957089/domaindir", locale=<optimized out>, 
    locale@entry=0x7fff80ed5d30 "existing-locale", domainname=domainname@entry=0x7fff80ed5d50 "LC_MESSAGES/existing-domain.mo", 
    domainbinding=domainbinding@entry=0xc51010) at finddomain.c:147
#4  0x00007fac3c64419c in __dcigettext (domainname=0xc52260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:626
#5  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#6  0x0000000000400aae in main (argc=2, argv=0x7fff80ed5fc8) at nls-test.c:115
(gdb) 

None: Testing/WhiteBox (last edited 2013-05-06 15:26:36 by CarlosODonell)