Possible malloc/mmap bug?

Chuck Hines hines@cert.org
Fri Nov 18 00:17:00 GMT 2011


Howdy all,

I've run across an interesting situation that I now think may be a bug
in the current glibc malloc implementation, but I still need some
additional info/verification to really be sure.

A partial back-story might be helpful to frame the problem, what I did
to investigate, and how I've reached the conclusions I've gotten to that
led to this email.

[Please forgive the length and the 'stream of consciousness' style at
times, but I definitely wanted to let you know that I did try many
different things before jumping to the conclusion that there might be a
bug in glibc malloc and that I did try to track things down.]

Someone else on my team wrote some code that is processing significant
amounts of data in RAM (on systems w/ 48GB of RAM) that appears to
occasionally "run out of memory" (malloc returns NULL and errno ==
ENOMEM) but external evidence seems to indicate otherwise (for instance:
the actual amount of data processed prior to the fail was not enough to
trigger that, watching resource monitors during the run doesn't show
memory filling up, etc).  He couldn't figure out what was going on, so I
offered to try and figure it out.

BTW, the code (in a nutshell) basically just reads data from the files
(32 bit integers) into a malloc'ed array, does a sort of each one, then
eliminates duplicate values and realloc's the array down to the new size
(usually significantly smaller) then collects info about the files that
each unique integer in the data came from into a combined list.

Visual inspection of the (rather simple) code showed that it looked
correct (doesn't it always?), and I was able to reproduce it easily
enough using the same input dataset that he saw it on, so I next tried
using dmalloc and valgrind (memcheck, massif, and ptrcheck) to see if
anything fishy but non obvious might be happening.  And when these were
introduced not only did they not report any errors but also made the
problem go away.  Oh what fun!  Added lots of debugging statements and
checks throughout the code, and that wasn't helping either...and setting
MALLOC_CHECK_ to 3 didn't reveal anything either.

One thing that was particularly interesting was that it was dying at a
number of files processed that was suspiciously close to the number
65535, and things like that always stand out, so I started looking at
process limit values, but that didn't seem to be the issue.

Then while reading (and rereading) the various malloc man pages and
results that Google kindly provided to me, I noticed that malloc was
likely using mmap under the covers for these allocations (which were all
around 256KB or so, the stated threshold for that mmap usage kicking in
being 128KB) and that it had a default value of 65536 for the number of
times it would do that (M_MMAP_MAX).  Ah ha!  So I tried calling
mallopt(M_MMAP_MAX,0) to make it not do that...and sure enough the
problem went away!

At this point I began to suspect a but in malloc where it was failing to
fall back to using sbrk() after that number was exceeded, so I crafted a
simple test case (attached) to test the problem, as well as grabbing the
glibc source to build myself to test various versions with, as well as
trying it against the stock glibc on various Linux variants and with
varying values for M_MMAP_MAX.  The results I found were...interesting.

First, the OS's (all 64 bit) I tried w/ the stock versions and the results:

- Ubuntu 10.04, system glibc 2.11.1: error (first system we saw it on)
- Ubuntu 11.04, system glibc 2.13: error (my development box)
  - same box, but hand built static glibc 2.14.1: no error
  - same box, but hand built static glibc 2.13: no error
- RHEL 5.7, system glibc 2.5: no error

Based on that I began to suspect perhaps a patch that Ubuntu applied
and/or the compile options they chose.  To further test that angle,
decided to try some newer RedHat derivatives on the assumption that they
would be using glibc versions closer to the Ubuntu ones, and I did these
by setting up some virtual machines:

- RHEL 6.0, system glibc 2.12: error
- Fedora 16, system glibc 2.14.90: error

So, based on these tests I now believe that the bug might be in the
stock glibc code but only exhibiting in the presence of whatever common
build options both Ubuntu and RedHat use.

At this point I figured that I had done enough legwork and it was time
to ask the experts if they could try out my test case in their various
environments and see if they think that there is a bug in the current
glibc malloc implementation under certain compilation circumstances or
not, and perhaps one of the developers that might see this here might
know the vagaries of the Ubuntu and/or RedHat distro build setup to know
where to look, etc.

And I figured that this list might be a better starting point than the
libc-alpha list, since it's bordering on a bug report. :)

A note about the test code, it will take an optional parameter to set
M_MMAP_MAX with, so you can see how various values affect the problem.
0 makes it go away (and interestingly enough, seems to be much faster at
allocations that way, while the mmap method seemed to have a greater
than linear increase in time to do each set of 10000 that I report on),
some numbers seem to have a fall-back behaviour that works (various
values from 1000-65514 seemed to work fine) while others don't (the
various ones I tested that were 65515 & up all seem to die at num+1).

Also, a note about how I built the static glibc...based on the various
errors I hit, the combinations of info I found while searching led me to
this:

  env CFLAGS='-O2 -U_FORTIFY_SOURCE -fno-stack-protector' ../configure
--prefix=`/bin/pwd`/inst --without-selinux --without-gd --enable-add-ons
--disable-profile --enable-kernel=2.6.0 --disable-shared

which worked enough for me to test with (there were still more build
errors, but I ended up telling make to ignore them and the resulting
libc.a I got had enough in it to run the test code).

Hm.  Looking at those parameters just made me decide to go back and
power through again but w/ selinux and gd enabled this time...and the
test code still doesn't exhibit the error that way (2.14.1 & 2.13) so I
guess those weren't the difference. :)

Uh...okay, on a whim I just tried linking my code statically on my
system (Ubuntu 11.04) against the static version of the system glibc
2.13, not my hand compiled one...and the test code (with no parameter
given) passed?!?!

Wow, now I really don't know what to think.  And my hand build issues
are preventing me from making a shared lib of the stock glibc code to
try that route right now...and I'm at the end of a really long day and
no longer capable right now of continuing to do tests that might reveal
more info or reason through why there might be a difference between the
static and dynamic versions.

Hopefully I'm not crazy and one of you (developers) will be able to use
my test code to recreate the problem I observed and know how to deal
with it.  Or maybe I'll have more brainpower available in the morning to
continue doing some investigations and see if I can report back here
with more info...

Thanks,
Chuck




-------------- next part --------------
A non-text attachment was scrubbed...
Name: malloc_mmap_bug_test.c
Type: text/x-csrc
Size: 4025 bytes
Desc: not available
URL: <http://sourceware.org/pipermail/libc-help/attachments/20111118/fe33e8b3/attachment.bin>


More information about the Libc-help mailing list