White Box Testing

1. Using SystemTap to validate manual code inspection

White box testing verifies the internal implementation details of the software under test. As of 2013-05-06 glibc has very little if any white box testing. The general policy has been that we implement standards conforming interfaces and that as such we need to test those interfaces. Testing interfaces is insufficient to discover all classes of errors.

This article discusses white-box testing in glibc using systemtap to inject failures into core routines and validate manual code inspection. The user in this article is not completely fictitious since this work was prompted by real users, but the discussed scenario is fictitious.

1.1. The Problem

A user has reported that they are seeing intermittent crashes in their applications under high memory load.

The crashes all appear to be in glibc.

After some back and forth with the user you are able to get a core file of the crash.

With the application, core file, and debugging symbols you have an excellent start to solving the problem:

[carlos@koi bug]$ gdb -c core.16292 ./user-application
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/user-application...done.
[New LWP 16292]
Core was generated by `/foo/user-application'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175

1175                          newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x1c292d0, domainbinding=domainbinding@entry=0x1c28010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fffea28aef8)
    at dcigettext.c:1175
#1  0x00007f380e6391ab in __dcigettext (domainname=0x1c29250 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fffea28b0b8) at nls-test.c:115
(gdb) p newmem
$1 = (transmem_block_t *) 0x0
(gdb)

Looking up the source you see:

1167                   else
1168                     {
1169                       malloc_count = 1;
1170                       freemem_size = INITIAL_BLOCK_SIZE;
1171                       newmem = (transmem_block_t *) malloc (freemem_size);
1172 # ifdef _LIBC
1173                       /* Add the block to the list of blocks we have to free
1174                          at some point.  */
1175                       newmem->next = transmem_list;
1176                       transmem_list = newmem;
1177 # endif
1178                     }
1179                   if (__builtin_expect (newmem == NULL, 0))
1180                     {
1181                       freemem = NULL;
1182                       freemem_size = 0;
1183                       __libc_lock_unlock (lock);
1184                       return (char *) -1;
1185                     }

In a high memory load situation malloc might fail and newmem->next will segfault if newmem is NULL.

The fix is to add a check for NULL and allow the following check to return -1.

It looks as if the the code had expected malloc to fail, but code had been introduced between the malloc and the use that rendered the check moot.

You apply the following fix:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..18b49b3 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -1170,10 +1170,13 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
                      freemem_size = INITIAL_BLOCK_SIZE;
                      newmem = (transmem_block_t *) malloc (freemem_size);
 # ifdef _LIBC
-                     /* Add the block to the list of blocks we have to free
-                        at some point.  */
-                     newmem->next = transmem_list;
-                     transmem_list = newmem;
+                     if (newmem != NULL)
+                       {
+                         /* Add the block to the list of blocks we have to free
+                            at some point.  */
+                         newmem->next = transmem_list;
+                         transmem_list = newmem;
+                       }
 # endif
                    }
                  if (__builtin_expect (newmem == NULL, 0))

You build a new glibc, distribute it to the user, and hope that the bug has been fixed.

You claim that you can't easily test the fix. In a high memory load situation you have no control over where malloc will fail and return NULL.

Time passes and the user returns saying glibc still fails in roughly the same place under high memory load.

The core file reveals the following:

[carlos@koi bug]$ gdb -c core.24446 ./user-application
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/foo/user-application...done.
[New LWP 24446]
Core was generated by `/foo/user-application'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fab8721a907 "charset=") at ../string/strstr.c:63
#1  0x00007fab870e2a9c in _nl_find_msg (domain_file=domain_file@entry=0x22732e0, domainbinding=domainbinding@entry=0x2272010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff780e8c78)
    at dcigettext.c:948
#2  0x00007fab870e31ab in __dcigettext (domainname=0x2273260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff780e8e38) at nls-test.c:115
(gdb) 

Looking at the code in question you see the following:

            /* Get the header entry.  This is a recursion, but it doesn't
               reallocate domain->conversions because we pass convert = 0.  */
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

            if (nullentry != NULL)
              {
                const char *charsetstr;

                charsetstr = strstr (nullentry, "charset=");

When _nl_find_msg returns an error of -1, that error is incorrectly interpreted as a valid string and passed to strstr which crashes. The code in question can't handle _nl_find_msg returning a value of -1, and this is disconcerting since one expects callers, particularly recursive ones, to handle error conditions correctly.

You fix this problem with the following patch:

diff --git a/intl/dcigettext.c b/intl/dcigettext.c
index 110307b..0daa508 100644
--- a/intl/dcigettext.c
+++ b/intl/dcigettext.c
@@ -941,6 +941,11 @@ _nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
            nullentry =
              _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
 
+           /* Resource problems are fatal.  If we continue onwards we will
+              only attempt to calloc a new conv_tab and fail later.  */
+           if (__builtin_expect (nullentry == (char *) -1, 0))
+             return (char *) -1;
+
            if (nullentry != NULL)
              {
                const char *charsetstr;

However, now you're wary of the the fact that the error return value of -1 might not be expected by all callers.

You ask yourself if there is any way to simulate a failure at this particular point?

Can you find all the callers of _nl_find_msg and force the result of the call to be -1 and observe how the system fails, fixing any further failures.

There is is a way to inject failures quickly.

1.2. Validation through failure injection

You can inject failures in several ways, including manually under a debugger like gdb. The way which we are going to show here is using systemtap. We demonstrate systemtap because it's easy and because eventually with the dyninst backend you should be able to do this as a fully unprivileged user. (Non-dyninst systemtap has some limited-privilege capabilities too, but require some sysadmin setup.)

You start with the original unpatched build to get comfortable with the process.

To simulate the original failure we need to have malloc return NULL at line 1171.

We use systemtap to see if line 1171 has debug information and which variables are accessible at that point:

stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:*")' | grep 117
process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:1170") $non_reversible:size_t $freemem:unsigned char* $freemem_size:size_t $inbuf:unsigned char const* $outbuf:unsigned char* $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t
process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:1176") $newmem:transmem_block_t* $non_reversible:size_t $freemem:unsigned char* $freemem_size:size_t $inbuf:unsigned char const* $outbuf:unsigned char* $malloc_count:int $lock:__libc_lock_t $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t

We have two lines we could put probe statements on and we decide to use line 1176.

We create the following systemtap script:

probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:1176")
{
        printf("_nl_find_msg: newmem is %d.\n", $newmem);
        printf("_nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.\n");
        $newmem = 0;
}

Next we need an application to test it with, and we use the following somewhat non-trivial program and script to set everything up:

#include <libintl.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <error.h>
#include <errno.h>

const struct
{
  const char *msgid;
  const char *msgstr;
} msgs[] =
{
#define INPUT(Str) { Str,
#define OUTPUT(Str) Str },
#include TESTSTRS_H
};

const char *catname[] =
{
  [LC_MESSAGES] = "LC_MESSAGES",
  [LC_TIME] = "LC_TIME",
  [LC_NUMERIC] = "LC_NUMERIC"
};

#define check_setlocale(cat, name) do {                                       \
    if (setlocale (cat, name) == NULL)                                        \
      {                                                                       \
        printf ("%s:%u: setlocale (%s, \"%s\"): %m\n",                        \
                __FILE__, __LINE__, #cat, name);                              \
        result = 1;                                                           \
      }                                                                       \
  } while (0)

static int
positive_gettext_test (void)
{
  size_t cnt;
  int result = 0;

  for (cnt = 0; cnt < sizeof (msgs) / sizeof (msgs[0]); ++cnt)
    {
      const char *found = gettext (msgs[cnt].msgid);
      printf ("found \"%s\"\n", found);

      if (found == NULL
          || (msgs[cnt].msgstr[0] != '\0'
              && strcmp (found, msgs[cnt].msgstr) != 0))
        {
          /* Oops, shouldn't happen.  */
          printf ("\
  gettext (\"%s\") failed, returned \"%s\", expected \"%s\"\n",
                  msgs[cnt].msgid, found, msgs[cnt].msgstr);
          result = 1;
        }
    }

  return result;
}

int
main (int argc, char *argv[])
{
  int result = 0;
  char * dom;

  /* This is the place where the .mo files are placed.  */
  if (argc > 1)
    {
      bindtextdomain ("existing-domain", argv[1]);
      bindtextdomain ("existing-time-domain", argv[1]);
      bindtextdomain ("non-existing-domain", argv[1]);
    }
  else
    {
      printf ("No arg!\n");
      exit (1);
    }
  /* The locale the catalog is created for is "existing-category".  Now
     set the various variables in question to this value and run the
     test.  */
  setenv ("LANGUAGE", "existing-locale", 1);
  setenv ("LC_ALL", "non-existing-locale", 1);
  setenv ("LC_MESSAGES", "non-existing-locale", 1);
  setenv ("LC_CTYPE", "non-existing-locale", 1);
  setenv ("LANG", "non-existing-locale", 1);
  check_setlocale (LC_CTYPE, "en_GB.UTF-8");
  check_setlocale (LC_MESSAGES, "en_GB.UTF-8");
  unsetenv ("OUTPUT_CHARSET");

  /* This is the name of the existing domain with a catalog for the
     LC_MESSAGES category.  */
  textdomain ("existing-domain");
  puts ("test `gettext' with LANGUAGE set");
  if (positive_gettext_test () != 0)
    {
      puts ("FAILED");
      result = 1;
    }

  return result;
}

The program loads the glibc messages, and translates from the default to en_GB.UTF-8, checking to make sure the translation matches what's expected.

The script expects a glibc build and source directory. The setup script looks like this:

set -e

# common_objpfx is a glibc build directory.
common_objpfx=/home/carlos/build/glibc-fedora/
test_program_prefix=/home/carlos/support/bug/
# objpfx is the path to the current directory you're building in
objpfx=/home/carlos/support/bug/
# srcpfx is a glibc source directory.
srcpfx=/home/carlos/src/glibc-fedora/

LC_ALL=C gawk -f ${srcpfx}intl/po2test.awk ${srcpfx}po/en_GB.po > ${objpfx}msgs.h

gcc -Wl,-rpath=${common_objpfx} -Wl,--dynamic-linker=${common_objpfx}elf/ld.so -g3 -O0 -DTESTSTRS_H=\"msgs.h\" -o nls-test nls-test.c

# Create the locale directories.
rm -rf ${objpfx}localedir
rm -rf ${objpfx}domaindir

mkdir -p ${objpfx}localedir/existing-locale/LC_MESSAGES
for f in ADDRESS COLLATE CTYPE IDENTIFICATION MEASUREMENT MONETARY NAME NUMERIC PAPER TELEPHONE TIME; do
  cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_$f \
        ${objpfx}localedir/existing-locale
done
cp -f ${common_objpfx}localedata/en_GB.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES \
      ${objpfx}localedir/existing-locale/LC_MESSAGES

# Create the domain directories.
mkdir -p ${objpfx}domaindir/existing-locale/LC_MESSAGES
mkdir -p ${objpfx}domaindir/existing-locale/LC_TIME
# Populate them.
msgfmt -o ${objpfx}domaindir/existing-locale/LC_MESSAGES/existing-domain.mo \
       -f ${srcpfx}po/en_GB.po
msgfmt -o ${objpfx}domaindir/existing-locale/LC_TIME/existing-time-domain.mo \
       -f ${srcpfx}po/en_GB.po

echo "GCONV_PATH=${common_objpfx}iconvdata \
LOCPATH=${objpfx}localedir:${common_objpfx}localedata \
${test_program_prefix}nls-test ${objpfx}domaindir > nls-test.out"

echo sudo stap -g nls-orig.stp --sysroot=/ \
--sysenv=GCONV_PATH=${common_objpfx}iconvdata \
--sysenv=LOCPATH=${objpfx}localedir:${common_objpfx}localedata \
-c \"${test_program_prefix}nls-test ${objpfx}domaindir\"

And we run it like this:

sudo stap -g nls-step0.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt
[sudo] password for carlos: 
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

To our excitement it fails. This is good since it means we might have reproduced the original problem the customer was seeing under heavy load.

The output text files shows we failed almost immediately with no additional output:

cat out.txt 
_nl_find_msg: newmem is 8312192.
_nl_find_msg: Forcing newmem to zero to simulate malloc always failing at this point.

We look at the generated core file and we see:

[carlos@koi bug]$ gdb -c core.12100 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bugs/nls-test...done.
[New LWP 12100]
Core was generated by `/home/carlos/support/bugs/nls-test /home/carlos/support/bugs/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098)
    at dcigettext.c:1175
1175                          newmem->next = transmem_list;
(gdb) bt
#0  _nl_find_msg (domain_file=domain_file@entry=0x7ea2e0, domainbinding=domainbinding@entry=0x7e9010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff4dceb098)
    at dcigettext.c:1175
#1  0x00007f633a1901ab in __dcigettext (domainname=0x7ea260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#2  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#3  0x0000000000400aae in main (argc=2, argv=0x7fff4dceb258) at nls-test.c:115
(gdb) 

Success! This is exactly the same failure as the original failure provided by the user.

What next? We know that fixing this bug causes _nl_find_msg to return -1 which is the cause of the next bug. We need to examine all the callers to see if they take into account a return of -1.

1.3. Injecting failures at all call sites

The next step is to find all the call sites of _nl_find_msg, trigger them, and inject failures at the call site, simulating _nl_find_msg returning -1.

grep -rn '_nl_find_msg' *
...
intl/dcigettext.c:630:    retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
intl/dcigettext.c:638:            retval = _nl_find_msg (domain->successor[cnt], binding,
intl/dcigettext.c:740:_nl_find_msg (domain_file, domainbinding, msgid, convert, lengthp)
intl/dcigettext.c:942:        _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/loadmsgcat.c:784:     initialization via _nl_find_msg.  We have initialized
intl/loadmsgcat.c:1259:  nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/gettextP.h:184:char *_nl_find_msg PARAMS ((struct loaded_l10nfile *domain_file,

There are only four locations of interest (the rest are headers, comments or the function istself):

intl/dcigettext.c:630:    retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
intl/dcigettext.c:638:            retval = _nl_find_msg (domain->successor[cnt], binding,
intl/dcigettext.c:942:        _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
intl/loadmsgcat.c:1259:  nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);

We use the following systemtap command to list a wider set of line numbers we can use for statement events:

stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("*@*dcigettext.c:*")'

We can trigger on line 630, and 632:

process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@/home/carlos/src/glibc-fedora/intl/dcigettext.c:630") $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int $n:long unsigned int $category:int $domain:struct loaded_l10nfile* $categoryvalue:char const* $dirname:char* $xdomainname:char* $saved_errno:int $search:struct known_translation_t $foundp:struct known_translation_t** $localename:char const* $domainname_len:size_t $tree_lock:__libc_rwlock_t
process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@/home/carlos/src/glibc-fedora/intl/dcigettext.c:632") $domainname:char const* $msgid1:char const* $msgid2:char const* $plural:int $n:long unsigned int $category:int $domain:struct loaded_l10nfile* $categoryvalue:char const* $dirname:char* $xdomainname:char* $retval:char* $saved_errno:int $search:struct known_translation_t $foundp:struct known_translation_t** $localename:char const* $domainname_len:size_t $tree_lock:__libc_rwlock_t

We choose line 632 and set retval to -1 to simulate _nl_find_msg failing in __dcigettext.

We use the following script:

probe process("/home/carlos/build/glibc-fedora/libc.so").statement("__dcigettext@*dcigettext.c:632")
{
        printf("_nl_find_msg: retval is %d.\n", $retval);
        printf("_nl_find_msg: Forcing retval to -1 to simulate _nl_find_msg failing.\n");
        $retval = -1;
}

Run it like this:

sudo stap -g nls-step1a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt
Warning: child process exited with status 1
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

Looking at out.txt we see that there have been no crashes or segfaults but the translations are not happening and the default locale messages are being returned.

A quick look at the source shows that this is expected:

 630           retval = _nl_find_msg (domain, binding, msgid1, 1, &retlen);
 631 
 632           if (retval == NULL)
 633             {
 634               int cnt;
 635 
 636               for (cnt = 0; domain->successor[cnt] != NULL; ++cnt)
 637                 {
 638                   retval = _nl_find_msg (domain->successor[cnt], binding,
 639                                          msgid1, 1, &retlen);
 640 
 641                   if (retval != NULL)
 642                     {
 643                       domain = domain->successor[cnt];
 644                       break;
 645                     }
 646                 }
 647             }
 648 
 649           /* Returning -1 means that some resource problem exists
 650              (likely memory) and that the strings could not be
 651              converted.  Return the original strings.  */
 652           if (__builtin_expect (retval == (char *) -1, 0))
 653             goto no_translation;

Injecting a failure for the first _nl_find_msg call here results in a goto no_translation which is a non-fatal problem.

In fact we can see our next _nl_find_msg call site at line 638.

Unfortunately triggering this next call to _nl_find_msg requires a considerably more complex set of locales with succesors. For the purposes of this article triggering this call to _nl_find_msg was sufficiently difficult that we had to give up. This process is a heuristical approach at covering all of the failure cases and sometimes it costs too much to get 100% coverage. For the sake of long-term maintenance we fix the code by having it eagerly skip translating if _nl_find_msg returns -1 e.g.:

@@ -638,6 +638,11 @@ DCIGETTEXT (domainname, msgid1, msgid2, plural, n, category)
                  retval = _nl_find_msg (domain->successor[cnt], binding,
                                         msgid1, 1, &retlen);
 
+                 /* Resource problems are not fatal, instead we return no
+                    translation.  */
+                 if (__builtin_expect (retval == (char *) -1, 0))
+                   goto no_translation;
+
                  if (retval != NULL)
                    {
                      domain = domain->successor[cnt];

This is the same as the code on line 652 but this is better contained and less likely to be moved around leaving a call site of _nl_find_msg without a check for -1 errors.

Moving on to the third call site at line 942 we check for a line to use as a statement:

stap -L 'process("/home/carlos/build/glibc-fedora/libc.so").statement("*@*dcigettext.c:*")' | grep 944
process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@/home/carlos/src/glibc-fedora/intl/dcigettext.c:944") $nullentry:char* $nullentrylen:size_t $encoding:char const* $convd:struct converted_domain* $domain_file:struct loaded_l10nfile* $domainbinding:struct binding* $msgid:char const* $convert:int $lengthp:size_t* $nstrings:nls_uint32 $resultlen:size_t

We mock up a quick failure injection:

probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_find_msg@*dcigettext.c:944")
{
        printf("_nl_find_msg: nullentry is %d.\n", $nullentry);
        printf("_nl_find_msg: Forcing nullentry to -1 to simulate _nl_find_msg failing.\n");
        $nullentry = -1;
}

This failure represents the second failure seen by the user.

Run it like this:

sudo stap -g nls-step2a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

Load the core file:

[carlos@koi bug]$ gdb -c core.18236 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 18236]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7f42a2ce5907 "charset=") at ../string/strstr.c:63
#1  0x00007f42a2bada93 in _nl_find_msg (domain_file=domain_file@entry=0x65f2e0, domainbinding=domainbinding@entry=0x65e010, 
    msgid=0x112 <Address 0x112 out of bounds>, msgid@entry=0x400ba0 "", convert=convert@entry=1, lengthp=lengthp@entry=0x7fff8b9c8388)
    at dcigettext.c:948
#2  0x00007f42a2bae1ab in __dcigettext (domainname=0x65f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:630
#3  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#4  0x0000000000400aae in main (argc=2, argv=0x7fff8b9c8548) at nls-test.c:115
(gdb) 

Success! This looks exactly like the second failure reported by the user.

We move on to the last call site line 1259 in intl/loadmsgcat.c.

We choose to set a statement on line 1260, the only available line with debug information nearest to 1259.

We use the following script:

probe process("/home/carlos/build/glibc-fedora/libc.so").statement("_nl_load_domain@*loadmsgcat.c:1260")
{
        printf ("_nl_load_domain: nullentry is '%x'\n", $nullentry);
        printf ("_nl_load_domain: Forcing nullentry to -1 to simulate _nl_find_msg failure.\n");
        $nullentry = -1;
}

Run it like this:

sudo stap -g nls-step3a.stp --sysroot=/ --sysenv=GCONV_PATH=/home/carlos/build/glibc-fedora/iconvdata --sysenv=LOCPATH=/home/carlos/support/bug/localedir:/home/carlos/build/glibc-fedora/localedata -c "/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir" > out.txt
[sudo] password for carlos: 
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

This is an unexpected and new failure. Something the user hasn't reported yet!

Loading the core file in gdb:

[carlos@koi bug]$ gdb -c core.19031 ./nls-test
GNU gdb (GDB) Fedora (7.4.50.20120120-54.fc17)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/carlos/support/bug/nls-test...done.
[New LWP 19031]
Core was generated by `/home/carlos/support/bug/nls-test /home/carlos/support/bug/domaindir'.
Program terminated with signal 11, Segmentation fault.
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63
63        while (*haystack && *needle)
(gdb) bt
#0  __strstr_sse2 (haystack_start=haystack_start@entry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    needle_start=needle_start@entry=0x7fae5c5e194e "plural=") at ../string/strstr.c:63
#1  0x00007fae5c4ada96 in __gettext_extract_plural (nullentry=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, 
    pluralp=0x79f4b8, npluralsp=0x79f4c0) at plural-exp.c:109
#2  0x00007fae5c4ab009 in _nl_load_domain (domain_file=domain_file@entry=0x79f2e0, domainbinding=domainbinding@entry=0x79e010)
    at loadmsgcat.c:1260
#3  0x00007fae5c4aaafd in _nl_find_domain (dirname=dirname@entry=0x79e040 "/home/carlos/support/bug/domaindir", locale=<optimized out>, 
    locale@entry=0x7fffc8e666f0 "existing-locale", domainname=domainname@entry=0x7fffc8e66710 "LC_MESSAGES/existing-domain.mo", 
    domainbinding=domainbinding@entry=0x79e010) at finddomain.c:147
#4  0x00007fae5c4aa18c in __dcigettext (domainname=0x79f260 "existing-domain", msgid1=0x400ba0 "", msgid2=0x0, plural=0, n=0, category=5)
    at dcigettext.c:626
#5  0x0000000000400821 in positive_gettext_test () at nls-test.c:44
#6  0x0000000000400aae in main (argc=2, argv=0x7fffc8e66988) at nls-test.c:115
(gdb) 

The backtrace shows that if _nl_find_msg fails during _nl_load_domain that the -1 error will be treated as a pluralization string and get passed to strstr which will crash.

The fix for this is to abort the loading of the domain if you can't read the pluralization information:

diff --git a/intl/loadmsgcat.c b/intl/loadmsgcat.c
index e4b7b38..ac90ed1 100644
--- a/intl/loadmsgcat.c
+++ b/intl/loadmsgcat.c
@@ -1237,7 +1237,7 @@ _nl_load_domain (domain_file, domainbinding)
     default:
       /* This is an invalid revision.  */
     invalid:
-      /* This is an invalid .mo file.  */
+      /* This is an invalid .mo file or we ran out of resources.  */
       free (domain->malloced);
 #ifdef HAVE_MMAP
       if (use_mmap)
@@ -1257,6 +1257,11 @@ _nl_load_domain (domain_file, domainbinding)
 
   /* Get the header entry and look for a plural specification.  */
   nullentry = _nl_find_msg (domain_file, domainbinding, "", 0, &nullentrylen);
+  if (__builtin_expect (nullentry == (char *) -1, 0))
+    {
+      __libc_rwlock_fini (domain->conversions_lock);
+      goto invalid;
+    }
   EXTRACT_PLURAL_EXPRESSION (nullentry, &domain->plural, &domain->nplurals);
 
  out:

1.4. Conclusion

The final step in the process is to patch the original malloc problem and at a minimum the two call sites (intl/dcigettext.c:942 and intl/loadmsgcat.c:1259) with missing checks for -1 returns from _nl_find_msg. Patching the call site at intl/dcigettext.c:638 is a maintenance issue and should be done to prevent future problems.

At this point you have a patch and you have 4 regression tests that cover all 4 callers of _nl_find_msg.

The regression tests can and should be put back into the glibc testsuite to prevent future changes from regressing the build.

At present running systemtap requires root privileges but with the upcoming dyninst (http://www.dyninst.org/dyninst) backend it should be possible to run all of the aboves tests as a normal user with minimal impact on the running code.

Why didn't we use the systemtap .return probe to capture function returns and adjust the return value of _nl_find_msg? The .return probe isn't supported in the setup I had, otherwise, yes, you can, in a setup that supports it, use the .return probe. I found instrumenting the call sites to be revealing of how the function was used, but in a mass test it would be easier to require .return probe support.