This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
- From: Arjan van de Ven <arjan at linux dot intel dot com>
- To: Ulrich Drepper <drepper at redhat dot com>
- Cc: libc-alpha at sources dot redhat dot com, val_henson at linux dot intel dot com
- Date: Thu, 02 Mar 2006 22:55:07 +0100
- Subject: Re: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
- References: <1141327016.3206.112.camel@laptopd505.fenrus.org> <440754B8.1020102@redhat.com>
Ulrich Drepper wrote:
Arjan van de Ven wrote:
The heuristic isn't
perfect but in practice it seems to hold up pretty well, and solves the
performance issue we've been investigating nicely.
Where is the data and test specification to back this up?
some supporting performance data; I've not done benchmarks on GUI apps
since those are really unreliable in general.
Before situation (on a 2 way machine):
# time ./benchmark
real 0m30.462s
user 0m39.406s
sys 0m19.613s
notice the high system time
or with the application from bug 1541:
# ./malloc-test32 4096000 2500 4
Average : 29.745164 seconds for 2500 requests of 4096000 bytes,
134218723MB concurrent.
the top kernel profile with "benchmark" looks like this (this is after
several hours of running the benchmark in a loop)
130375 free_hot_cold_page 521.5000
132808 __pagevec_lru_add_active 567.5556
133687 thread_return 624.7056
180467 release_pages 483.8257
184343 unmap_vmas 98.1070
193584 find_vma 2081.5484
275611 __down_read_trylock 4053.1029
350792 get_page_from_freelist 363.8921
668687 __handle_mm_fault 290.9865
885636 default_idle 10543.2857
1066396 do_page_fault 568.7445
3478485 clear_page 61026.0526
8608929 total 3.7529
clear_page is by far the biggest cpu user, do_page_fault is there too as
second. The high number of page faults, which require zeroed pages, has
shown to be the issue (we've tried quite a few things on the kernel side
to see if there was a kernel stupidity, but there wasn't really, only
really small stuff that only made a really small impact in the 5%
range). I'll spare everyone here those details and skip ahead to the
final result directly:
the same benchmark run with the patch I posted:
# time ./benchmark
real 0m17.443s
user 0m34.810s
sys 0m0.012s
notice the really low system time. (it's so low that a kernel profile
isn't useful anymore).
and the 1541 program:
# ./malloc-test32 4096000 2500 4
Average : 1.662635 seconds for 2500 requests of 4096000 bytes, 134218721MB
concurrent.
(stunning difference; but repeatable so not an artifact of a fluke run)
The fundamental issue is not that brk() is this more efficient than
mmap(). On the kernel side it for sure is not, it's basically the same
(there is some minor efficiencies in brk compared to mmap but those are
truely minor, again in that 5% range). The real killer difference is
that brk memory gets recycled *by glibc*, so that after the recycle, no
new pagefaults and zeroing happen.
Alternative to the patch I posted would be to create a "recycling
engine" for mmap's in glibc next to the brk recycling; however the
question then became "why not just brk". And the only answer we came up
with was "yes lets use it". Many of the reasons for chosing a small
threshold like 128Kb in 2001 are no longer the same, not for "new"
applications. I could have posted a one liner to just bump the threshold
statically, and that's a 0th order solution obviously. But the original
idea to use mmap for long lived allocations is in principle sound, so
the proposed simple heuristic is basically a 1st order way to achieve
that even for old applications (as long as they don't start freeing big
blocks of memory early).
So as summary: brk and mmap are basically equaly expensive, except for
the recycling that happens by glibc on brk areas.