Bug 1541

Summary: Poor threaded application performance when using malloc
Product: glibc Reporter: Steven Munroe <sjmunroe>
Component: libcAssignee: Ulrich Drepper <drepper.fsp>
Status: RESOLVED FIXED    
Severity: normal CC: fweimer, glibc-bugs, roland
Priority: P2 Flags: fweimer: security-
Version: unspecified   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed:
Attachments: Threaded malloc test with MMAP_THRESHOLD options
Oprofile of malloc-test 128000 1000 8 on Dual PPC64 G5
profile from similar run but with MMAP_THRESHOLD increased to 16M

Description Steven Munroe 2005-10-25 14:11:50 UTC
Threaded applications that use malloc to allocate large buffer/work ares will
suffer significant performance degradation when ever the allocation size exceeds
the MMAP_THRESHOLD.

When a malloc allocation size exceeds the MMAP_THRESHOLD the storage is
allocated via anonymous mmap insted of from brt storage. The mmap syscal only
allocate the region, no pages are allocated until 1st touch. So there is page
fault for each page as it is touched for the 1st time. The kernel has a
semaphore around the "allocate zeroed page" operation which seriallizes this
operation for threaded applications. These anonymous mmap regions are not resued
by malloc so the "fault/zero page" bottleneck is ocurrs for every large allocation.

This can be seen as a kernel problem but it is also a glibc problem because for
some application the default MMAP_THRESHOLD (normally 128K) is simply too small.
Changing the MMAP_THRESHOLD to a value large enough to handle most allocations
gives a signicant speed up. 

For 64-bit platforms it could be wise to bump up the default thresholds to a
more reasonable value (say 16M). Or we need a simple and effective way to change
the thresholds from outside the applications. The mallopt API can used used to
change the default MMAP_THRESHOLD but many customers are reluctant to change
their source "just for Linux". And enviroment varible based mechansim may be
more acceptable.
Comment 1 Steven Munroe 2005-10-25 15:32:15 UTC
Created attachment 724 [details]
Threaded malloc test with MMAP_THRESHOLD options

To build use:
   gcc -g -O2 malloc-test.c -lpthread -o malloc-test
or 
   gcc -g -O2 -DMAP_THRESHOLD=16777216 malloc-test.c -lpthread -o
malloc-test_16M
Comment 2 Steven Munroe 2005-10-25 15:53:38 UTC
To run the testcase single threaded 
   ./malloc_test 128000 10000
   ...
   Average : 0.718383 seconds for 10000 requests of 128000 bytes, 491MB concurrent.

To run with 16 threads
   ./malloc_test 128000 10000 16
   ...
   Average : 1.280583 seconds for 10000 requests of 128000 bytes, 490MB concurrent.


These run quickly because 128000 is less than the cash threshold. Now try with a
malloc size larger than the MMAP_THRESHOLD:
   ./malloc_test 1280000 10000 16
   ... 
   Average : 227.594933 seconds for 10000 requests of 421006 bytes, 488MB
concurrent.

Notice the huge jump from 1.28 to 227 seconds while to total concurrent storage
remained constant around 490MB!

Now try a version of malloc-test that changes the MMAP_THRESHOLD to 16M:

   ./malloc-test_16M 1280000 10000 16
   ...
   Average : 7.473701 seconds for 10000 requests of 421006 bytes, 488MB concurrent.

The time comes down to a more reasonable 7.47 seconds. Finally to verify that
larger MMAP_THRESHOLD does not negatively impact smalled allocatoions try.

   ./malloc-test_16M 128000 10000 16
   ...
   1.066022 seconds for 10000 requests of 128000 bytes, 490MB concurrent.

Which in this case is faster than with to smalled default MMAP_THRESHOLD.

All runs on my dual 2GHz G5 (PPC64/970) system, but I see simular results on my
dual Athelon system. So I suspect this a common problem across SMP platforms.
Comment 3 Roland McGrath 2005-11-01 08:05:10 UTC
Have you done any profiling to substantiate your analysis of why it is slower?
I see nothing in the kernel to suggest that brk preallocates zero-fill pages.
Your test program preallocates them in its early iterations and then reuses
those pages by freeing and allocating repeatedly, I would suspect.  Profiling
would show the time spent in mmap/munmap syscalls vs spent faulting in pages,
for example.
Comment 4 Steven Munroe 2005-11-01 17:12:18 UTC
Created attachment 733 [details]
Oprofile of malloc-test 128000 1000 8 on Dual PPC64 G5

This profile show that when the MMAP_THRESHOLD is exceeded we see a big
increase in kernel time. The kernel time is associate with the locking,
schedualing, and page fault.

I don't have access to a i386 SMP box with at the moment but I suspect the
profile there will be similar.
Comment 5 Steven Munroe 2005-11-01 17:19:23 UTC
Created attachment 734 [details]
profile from similar run but with MMAP_THRESHOLD increased to 16M

Increasing the MMAP_THRESHOLD improved performance so I had the increase the
number of iterations to get the test to run long enoigh to profile. The profile
show most of the time (92%) in the test application (run_test) and and a few
percent in the malloc runtime. The first kernel contribution starts at 0.2% for
schedule.
Comment 6 Steven Munroe 2005-11-01 17:27:00 UTC
Yes arenas allocated in brk store page fault once but are effeciently reused.
The problem with large allocations is that the storage allocated with mmap is
unmapped with the free(). So each new allocation that exceeds the MMAP_THRESHOLD
has to be faulted in. 

The mmap syscall does not do much work. Most of the effort of allocating the
page and zeroing it out is defered until the page is actually touched the first
time. This is reflected in the profiles attached above.
Comment 7 Ulrich Drepper 2007-02-18 04:45:32 UTC
This should have been dealt with in a malloc patch which went in some time ago.
 Verify and close or elaborate.
Comment 8 Petr Baudis 2010-06-01 03:30:37 UTC
The adaptive mmap threshold should have fixed this; no response, so it's
probably safe to assume so.
Comment 9 Jackie Rosen 2014-02-16 19:41:53 UTC Comment hidden (spam)