This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
[patch] Make the mmap/brk threshold in malloc dynamic to improve performance
- From: Arjan van de Ven <arjan at linux dot intel dot com>
- To: libc-alpha at sources dot redhat dot com
- Cc: val_henson at linux dot intel dot com
- Date: Thu, 02 Mar 2006 20:16:56 +0100
- Subject: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
Hi,
We've been investigating a performance issue that in the end turns out
to be best fixed in glibc.
The test case can best be described as a threaded application that has a
pool of worker threads that, for a transaction, allocate a piece of
memory (of around 1 to 8Mb, depending on the configuration), use that
for the transaction and then free it again, and return to the pool.
The performance issue arises because the glibc malloc allocator decides
to use mmap() for these requests. In 2001, that might have been a fine
choice.. but today it's not really. The kernel is required to zero the
memory for new mmaps, and for the test application over half the total
cpu time is spent just clearing these pages in the kernel. We've looked
at if we could fix the kernel, but realistically there is no choice:
zeroing memory is expensive. it eats cache and memory bandwidth, two
precious commodities these days. When we force the use of the brk()
allocator, performance is fine (eg over 2x that of the mmap-using
scenario).
This lead us to look at the threshold, which today is 128Kb. 128Kb was
in 2001 empirically determined according to the comments in the source
code. We believe there are some good reasons to reconsider this
threshold:
1) The VA space layout on 32 bit linux has changed
2) Application and memory sizes have grown since 2001
In the old days (of 2.2, 2.4 and early 2.6 kernels), on 32 bit linux the
mmap area started at 1Gb, and brk() was thus limited to some 800Mb in
size. Getting all big allocations out of the way was important back
then. Nowadays, the mmap area grows down from below the stack, so this
artificial limit is gone.
2) speaks for itself basically; not just processor MHz grew,
applications and workloads did too. No doubt a lot of it is programmer
sloppyness, but it's also reality.
Rather than just bumping the 128Kb to something bigger, we decided to
make the value dynamic via a simple algorithm:
within the 128Kb-32Mb range (64Mb on 64 bit), we keep track of the
biggest free() that was done on an mmap'd piece of memory, and use that
size as threshold from then on.
The rationale is that you on the one hand want temporary memory to be
brk() allocated, while for fragmentation reasons you want long term big
allocations to be mmap allocated. The dynamic threshold via this simple
algorithm tends to get all early allocations, and all allocations bigger
than temporary ones as long term and thus mmap. The heuristic isn't
perfect but in practice it seems to hold up pretty well, and solves the
performance issue we've been investigating nicely.
The one gotcha (and the biggest part of the patch) is that the dynamic
behavior has to stop once the user manually sets these thresholds; we
have to honor these obviously.
This work was done by me and Val Henson <val_henson@linux.intel.com>
Please consider applying.
--- glibc-20060217T1609/malloc/malloc.c.org 2006-03-02 10:31:45.000000000 +0100
+++ glibc-20060217T1609/malloc/malloc.c 2006-03-02 10:39:05.000000000 +0100
@@ -1405,6 +1405,19 @@ int __posix_memalign(void **, size_
#endif
/*
+ MMAP_THRESHOLD_MAX and _MIN are the bounds on the dynamically
+ adjusted MMAP_THRESHOLD.
+*/
+
+#ifndef DEFAULT_MMAP_THRESHOLD_MIN
+#define DEFAULT_MMAP_THRESHOLD_MIN (128 * 1024)
+#endif
+
+#ifndef DEFAULT_MMAP_THRESHOLD_MAX
+#define DEFAULT_MMAP_THRESHOLD_MAX (8 * 1024 * 1024 * sizeof(long))
+#endif
+
+/*
M_MMAP_THRESHOLD is the request size threshold for using mmap()
to service a request. Requests of at least this size that cannot
be allocated using already-existing space will be serviced via mmap.
@@ -1443,12 +1456,63 @@ int __posix_memalign(void **, size_
"large" chunks, but the value of "large" varies across systems. The
default is an empirically derived value that works well in most
systems.
+
+
+ Update in 2006:
+ The above was written in 2001. Since then the world has changed a lot.
+ Memory got bigger. Applications got bigger. The virtual address space
+ layout in 32 bit linux changed.
+
+ In the new situation, brk() and mmap space is shared and there are no
+ artificial limits on brk size imposed by the kernel. What is more,
+ applications have started using transient allocations larger than the
+ 128Kb as was imagined in 2001.
+
+ The price for mmap is also high now; each time glibc mmaps from the
+ kernel, the kernel is forced to zero out the memory it gives to the
+ application. Zeroing memory is expensive and eats a lot of cache and
+ memory bandwidth. This has nothing to do with the efficiency of the
+ virtual memory system, by doing mmap the kernel just has no choice but
+ to zero.
+
+ In 2001, the kernel had a maximum size for brk() which was about 800
+ megabytes on 32 bit x86, at that point brk() would hit the first
+ mmaped shared libaries and couldn't expand anymore. With current 2.6
+ kernels, the VA space layout is different and brk() and mmap
+ both can span the entire heap at will.
+
+ Rather than using a static threshold for the brk/mmap tradeoff,
+ we are now using a simple dynamic one. The goal is still to avoid
+ fragmentation. The old goals we kept are
+ 1) try to get the long lived large allocations to use mmap()
+ 2) really large allocations should always use mmap()
+ and we're adding now:
+ 3) transient allocations should use brk() to avoid forcing the kernel
+ having to zero memory over and over again
+
+ The implementation works with a sliding threshold, which is by default
+ limited to go between 128Kb and 32Mb (64Mb for 64 bitmachines) and starts
+ out at 128Kb as per the 2001 default.
+
+ This allows us to satisfy requirement 1) under the assumption that long
+ lived allocations are made early in the process' lifespan, before it has
+ started doing dynamic allocations of the same size (which will
+ increase the threshold).
+
+ The upperbound on the threshold satisfies requirement 2)
+
+ The threshold goes up in value when the application frees memory that was
+ allocated with the mmap allocator. The idea is that once the application
+ starts freeing memory of a certain size, it's highly probable that this is
+ a size the application uses for transient allocations. This estimator
+ is there to satisfy the new third requirement.
+
*/
#define M_MMAP_THRESHOLD -3
#ifndef DEFAULT_MMAP_THRESHOLD
-#define DEFAULT_MMAP_THRESHOLD (128 * 1024)
+#define DEFAULT_MMAP_THRESHOLD DEFAULT_MMAP_THRESHOLD_MIN
#endif
/*
@@ -2237,6 +2301,10 @@ struct malloc_par {
int n_mmaps;
int n_mmaps_max;
int max_n_mmaps;
+ /* the mmap_threshold is dynamic, until the user sets
+ it manually, at which point we need to disable any
+ dynamic behavior. */
+ int no_dyn_threshold;
/* Cache malloc_getpagesize */
unsigned int pagesize;
@@ -3414,6 +3482,12 @@ public_fREe(Void_t* mem)
#if HAVE_MMAP
if (chunk_is_mmapped(p)) /* release mmapped memory. */
{
+ /* see if the dynamic brk/mmap threshold needs adjusting */
+ if (!mp_.no_dyn_threshold && (p->size > mp_.mmap_threshold) &&
+ (p->size <= DEFAULT_MMAP_THRESHOLD_MAX)) {
+ mp_.mmap_threshold = p->size;
+ mp_.trim_threshold = 2 * mp_.mmap_threshold;
+ }
munmap_chunk(p);
return;
}
@@ -5404,10 +5478,12 @@ int mALLOPt(param_number, value) int par
case M_TRIM_THRESHOLD:
mp_.trim_threshold = value;
+ mp_.no_dyn_threshold = 1;
break;
case M_TOP_PAD:
mp_.top_pad = value;
+ mp_.no_dyn_threshold = 1;
break;
case M_MMAP_THRESHOLD:
@@ -5418,6 +5494,7 @@ int mALLOPt(param_number, value) int par
else
#endif
mp_.mmap_threshold = value;
+ mp_.no_dyn_threshold = 1;
break;
case M_MMAP_MAX:
@@ -5427,6 +5504,7 @@ int mALLOPt(param_number, value) int par
else
#endif
mp_.n_mmaps_max = value;
+ mp_.no_dyn_threshold = 1;
break;
case M_CHECK_ACTION: