Sources Bugzilla – Bug 1541
Poor threaded application performance when using malloc
Last modified: 2010-06-01 03:30:37 UTC
Threaded applications that use malloc to allocate large buffer/work ares will suffer significant performance degradation when ever the allocation size exceeds the MMAP_THRESHOLD. When a malloc allocation size exceeds the MMAP_THRESHOLD the storage is allocated via anonymous mmap insted of from brt storage. The mmap syscal only allocate the region, no pages are allocated until 1st touch. So there is page fault for each page as it is touched for the 1st time. The kernel has a semaphore around the "allocate zeroed page" operation which seriallizes this operation for threaded applications. These anonymous mmap regions are not resued by malloc so the "fault/zero page" bottleneck is ocurrs for every large allocation. This can be seen as a kernel problem but it is also a glibc problem because for some application the default MMAP_THRESHOLD (normally 128K) is simply too small. Changing the MMAP_THRESHOLD to a value large enough to handle most allocations gives a signicant speed up. For 64-bit platforms it could be wise to bump up the default thresholds to a more reasonable value (say 16M). Or we need a simple and effective way to change the thresholds from outside the applications. The mallopt API can used used to change the default MMAP_THRESHOLD but many customers are reluctant to change their source "just for Linux". And enviroment varible based mechansim may be more acceptable.
Created attachment 724 [details] Threaded malloc test with MMAP_THRESHOLD options To build use: gcc -g -O2 malloc-test.c -lpthread -o malloc-test or gcc -g -O2 -DMAP_THRESHOLD=16777216 malloc-test.c -lpthread -o malloc-test_16M
To run the testcase single threaded ./malloc_test 128000 10000 ... Average : 0.718383 seconds for 10000 requests of 128000 bytes, 491MB concurrent. To run with 16 threads ./malloc_test 128000 10000 16 ... Average : 1.280583 seconds for 10000 requests of 128000 bytes, 490MB concurrent. These run quickly because 128000 is less than the cash threshold. Now try with a malloc size larger than the MMAP_THRESHOLD: ./malloc_test 1280000 10000 16 ... Average : 227.594933 seconds for 10000 requests of 421006 bytes, 488MB concurrent. Notice the huge jump from 1.28 to 227 seconds while to total concurrent storage remained constant around 490MB! Now try a version of malloc-test that changes the MMAP_THRESHOLD to 16M: ./malloc-test_16M 1280000 10000 16 ... Average : 7.473701 seconds for 10000 requests of 421006 bytes, 488MB concurrent. The time comes down to a more reasonable 7.47 seconds. Finally to verify that larger MMAP_THRESHOLD does not negatively impact smalled allocatoions try. ./malloc-test_16M 128000 10000 16 ... 1.066022 seconds for 10000 requests of 128000 bytes, 490MB concurrent. Which in this case is faster than with to smalled default MMAP_THRESHOLD. All runs on my dual 2GHz G5 (PPC64/970) system, but I see simular results on my dual Athelon system. So I suspect this a common problem across SMP platforms.
Have you done any profiling to substantiate your analysis of why it is slower? I see nothing in the kernel to suggest that brk preallocates zero-fill pages. Your test program preallocates them in its early iterations and then reuses those pages by freeing and allocating repeatedly, I would suspect. Profiling would show the time spent in mmap/munmap syscalls vs spent faulting in pages, for example.
Created attachment 733 [details] Oprofile of malloc-test 128000 1000 8 on Dual PPC64 G5 This profile show that when the MMAP_THRESHOLD is exceeded we see a big increase in kernel time. The kernel time is associate with the locking, schedualing, and page fault. I don't have access to a i386 SMP box with at the moment but I suspect the profile there will be similar.
Created attachment 734 [details] profile from similar run but with MMAP_THRESHOLD increased to 16M Increasing the MMAP_THRESHOLD improved performance so I had the increase the number of iterations to get the test to run long enoigh to profile. The profile show most of the time (92%) in the test application (run_test) and and a few percent in the malloc runtime. The first kernel contribution starts at 0.2% for schedule.
Yes arenas allocated in brk store page fault once but are effeciently reused. The problem with large allocations is that the storage allocated with mmap is unmapped with the free(). So each new allocation that exceeds the MMAP_THRESHOLD has to be faulted in. The mmap syscall does not do much work. Most of the effort of allocating the page and zeroing it out is defered until the page is actually touched the first time. This is reflected in the profiles attached above.
This should have been dealt with in a malloc patch which went in some time ago. Verify and close or elaborate.
The adaptive mmap threshold should have fixed this; no response, so it's probably safe to assume so.