This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance

From: Arjan van de Ven <arjan at linux dot intel dot com>
To: Ulrich Drepper <drepper at redhat dot com>
Cc: libc-alpha at sources dot redhat dot com, val_henson at linux dot intel dot com
Date: Thu, 02 Mar 2006 22:55:07 +0100
Subject: Re: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
References: <1141327016.3206.112.camel@laptopd505.fenrus.org> <440754B8.1020102@redhat.com>

Ulrich Drepper wrote:

Arjan van de Ven wrote:
The heuristic isn't
perfect but in practice it seems to hold up pretty well, and solves the
performance issue we've been investigating nicely.
Where is the data and test specification to back this up?

some supporting performance data; I've not done benchmarks on GUI apps since those are really unreliable in general.

Before situation (on a 2 way machine):

# time ./benchmark

real    0m30.462s
user    0m39.406s
sys     0m19.613s

notice the high system time

or with the application from bug 1541:

# ./malloc-test32 4096000 2500 4

Average : 29.745164 seconds for 2500 requests of 4096000 bytes, 134218723MB concurrent.

the top kernel profile with "benchmark" looks like this (this is after several hours of running the benchmark in a loop)

130375 free_hot_cold_page                       521.5000
132808 __pagevec_lru_add_active                 567.5556
133687 thread_return                            624.7056
180467 release_pages                            483.8257
184343 unmap_vmas                                98.1070
193584 find_vma                                 2081.5484
275611 __down_read_trylock                      4053.1029
350792 get_page_from_freelist                   363.8921
668687 __handle_mm_fault                        290.9865
885636 default_idle                             10543.2857
1066396 do_page_fault                            568.7445
3478485 clear_page                               61026.0526
8608929 total                                      3.7529

clear_page is by far the biggest cpu user, do_page_fault is there too as second. The high number of page faults, which require zeroed pages, has shown to be the issue (we've tried quite a few things on the kernel side to see if there was a kernel stupidity, but there wasn't really, only really small stuff that only made a really small impact in the 5% range). I'll spare everyone here those details and skip ahead to the final result directly:

the same benchmark run with the patch I posted:

# time ./benchmark

real    0m17.443s
user    0m34.810s
sys     0m0.012s

notice the really low system time. (it's so low that a kernel profile isn't useful anymore).

and the 1541 program:

# ./malloc-test32 4096000 2500 4

Average : 1.662635 seconds for 2500 requests of 4096000 bytes, 134218721MB
concurrent.

(stunning difference; but repeatable so not an artifact of a fluke run)

The fundamental issue is not that brk() is this more efficient than mmap(). On the kernel side it for sure is not, it's basically the same (there is some minor efficiencies in brk compared to mmap but those are truely minor, again in that 5% range). The real killer difference is that brk memory gets recycled *by glibc*, so that after the recycle, no new pagefaults and zeroing happen.

Alternative to the patch I posted would be to create a "recycling engine" for mmap's in glibc next to the brk recycling; however the question then became "why not just brk". And the only answer we came up with was "yes lets use it". Many of the reasons for chosing a small threshold like 128Kb in 2001 are no longer the same, not for "new" applications. I could have posted a one liner to just bump the threshold statically, and that's a 0th order solution obviously. But the original idea to use mmap for long lived allocations is in principle sound, so the proposed simple heuristic is basically a 1st order way to achieve that even for old applications (as long as they don't start freeing big blocks of memory early).

So as summary: brk and mmap are basically equaly expensive, except for the recycling that happens by glibc on brk areas.

References:
- [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
  - From: Arjan van de Ven
- Re: [patch] Make the mmap/brk threshold in malloc dynamic to improve performance
  - From: Ulrich Drepper

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]