malloc uses excessive memory for multi-threaded applications The following program demonstrates malloc(3) using in excess of 600 megabytes of system memory while the program has never allocated more than 100 megabytes at any given time. This results from the use of thread-specific "preferred arenas" for memory allocations. The program first starts by contending a number of threads doing simple malloc/frees, with no net memory allocations. This establishes preferred arenas for each thread as a result of USE_ARENAS and PER_THREADS. Once preferred arenas are established, the program then has each thread, in turn, allocate 100 megabytes and then free all but 20 kilobytes, for a net memory allocation of 200 kilobytes. The resulting malloc_stats() show 600 megabytes of allocated memory that cannot be returned to the system. Over time, fragmentation of the heap can cause excessive paging when actual memory allocation never exceeded system capacity. With the use of preferred arenas in this way, multi-threaded program memory usage is essentially unbounded (or bounded to the number of threads times the actual memory usage). The program run and source code is below, as well as the glibc version from my RHEL5 system. Thank you for your consideration. [root@lab2-160 test_heap]# ./memx creating 10 threads allowing threads to contend to create preferred arenas display preferred arenas Arena 0: system bytes = 135168 in use bytes = 2880 Arena 1: system bytes = 135168 in use bytes = 2224 Arena 2: system bytes = 135168 in use bytes = 2224 Arena 3: system bytes = 135168 in use bytes = 2224 Arena 4: system bytes = 135168 in use bytes = 2224 Arena 5: system bytes = 135168 in use bytes = 2224 Total (incl. mmap): system bytes = 811008 in use bytes = 14000 max mmap regions = 0 max mmap bytes = 0 allowing threads to allocate 100MB each, sequentially in turn thread 3 alloc 100MB thread 3 free 100MB-20kB thread 5 alloc 100MB thread 5 free 100MB-20kB thread 7 alloc 100MB thread 7 free 100MB-20kB thread 2 alloc 100MB thread 2 free 100MB-20kB thread 0 alloc 100MB thread 0 free 100MB-20kB thread 8 alloc 100MB thread 8 free 100MB-20kB thread 4 alloc 100MB thread 4 free 100MB-20kB thread 6 alloc 100MB thread 6 free 100MB-20kB thread 9 alloc 100MB thread 9 free 100MB-20kB thread 1 alloc 100MB thread 1 free 100MB-20kB Arena 0: system bytes = 100253696 in use bytes = 40928 Arena 1: system bytes = 100184064 in use bytes = 42352 Arena 2: system bytes = 100163584 in use bytes = 22320 Arena 3: system bytes = 100163584 in use bytes = 22320 Arena 4: system bytes = 100163584 in use bytes = 22320 Arena 5: system bytes = 100204544 in use bytes = 62384 Total (incl. mmap): system bytes = 601133056 in use bytes = 212624 max mmap regions = 0 max mmap bytes = 0 [root@lab2-160 test_heap]# rpm -q glibc glibc-2.5-42.el5_4.2 glibc-2.5-42.el5_4.2 [root@lab2-160 test_heap]# ==================================================================== [root@lab2-160 test_heap]# cat memx.c // **************************************************************************** #include <stdio.h> #include <errno.h> #include <assert.h> #include <stdlib.h> #include <pthread.h> #include <inttypes.h> #define NTHREADS 10 #define NALLOCS 10000 #define ALLOCSIZE 10000 static volatile int go; static volatile int die; static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static void *ps[NALLOCS]; // allocations that are freed in turn by each thread static void *pps1[NTHREADS]; // straggling allocations to prevent arena free static void *pps2[NTHREADS]; // straggling allocations to prevent arena free void my_sleep( int ms ) { int rv; struct timespec ts; struct timespec rem; ts.tv_sec = ms / 1000; ts.tv_nsec = (ms % 1000) * 1000000; for (;;) { rv = nanosleep(&ts, &rem); if (! rv) { break; } assert(errno == EINTR); ts = rem; } } void * my_thread( void *context ) { int i; int rv; void *p; // first we spin to get our own arena while (go == 0) { p = malloc(ALLOCSIZE); assert(p); if (rand()%20000 == 0) { my_sleep(10); } free(p); } // then we give main a chance to print stats while (go == 1) { my_sleep(100); } assert(go == 2); // then one thread at a time, do our big allocs rv = pthread_mutex_lock(&mutex); assert(! rv); printf("thread %d alloc 100MB\n", (int)(intptr_t)context); for (i = 0; i < NALLOCS; i++) { ps[i] = malloc(ALLOCSIZE); assert(ps[i]); } printf("thread %d free 100MB-20kB\n", (int)(intptr_t)context); // N.B. we leave two allocations straggling pps1[(int)(intptr_t)context] = ps[0]; for (i = 1; i < NALLOCS-1; i++) { free(ps[i]); } pps2[(int)(intptr_t)context] = ps[i]; rv = pthread_mutex_unlock(&mutex); assert(! rv); } int main() { int i; int rv; pthread_t thread; printf("creating %d threads\n", NTHREADS); for (i = 0; i < NTHREADS; i++) { rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i); assert(! rv); rv = pthread_detach(thread); assert(! rv); } printf("allowing threads to contend to create preferred arenas\n"); my_sleep(20000); printf("display preferred arenas\n"); go = 1; my_sleep(1000); malloc_stats(); printf("allowing threads to allocate 100MB each, sequentially in turn\n"); go = 2; my_sleep(5000); malloc_stats(); // free the stragglers for (i = 0; i < NTHREADS; i++) { free(pps1[i]); free(pps2[i]); } return 0; } [root@lab2-160 test_heap]#
You don't understand the difference between address space and allocated memory. The cost of large amounts of allocated address space is insignificant. If you don't want it control it using the MALLOC_ARENA_MAX and MALLOC_ARENA_TEST envvars.
Actually, I totally understand the difference and that is why I mentioned the fragmentation of memory... When each arena has just a few straggling allocations, the maximum *committed* RAM required for the program's *working set* using the thread-preferred arena model is, in fact, N times that required for a traditional model, where N is the number of threads. This shows up in real-world thrashing that could actually be avoided. Basically, if the program is doing small allocations, a small percentage of stragglers can pin the entire allocated space -- and the allocated space is, in fact, much larger than it needs to be (and larger than it is in other OS's). But thank you for your time -- we all want the same thing here, a ever better Linux that is more suited to heavily threaded applications. :-)
Hi Ulrich, I apologize in advance and want you to know I will not reopen this bug again, but I felt I had to show you a new test program that clearly shows "The cost of large amounts of allocated address space is insignificant" can be exceedingly untrue for heavily threaded systems using large amounts of memory. In our product, we require 2x the RAM on Linux vs other OS's because of this. :-( I've reduced the problem to a program that you can invoke with no options and it runs fine, but with the "-x" option it thrashes wildly. The only difference is that in the "-x" case we allow the threads to do some dummy malloc/frees up front to create thread-preferred arenas. The program simply has a bunch of threads that, in turn (i.e., not concurrently), allocate a bunch of memory, and then free most (but not all!) of it. The resulting allocations easily fit in RAM, even when fragmented. It then attempts to memset the unfreed memory to 0. The problem is that in the thread-preferred arena case, the fragmented allocations are now spread over 10x the virtual space, and when accessed, result in actual commitment of at least 2x the physical space -- enough to push us over the top of RAM and into thrashing. So as a result, without the -x option, the program memset runs in two seconds or so on my system (8-way, 2GHz, 12GB RAM); with the -x option, the program memset can take hundreds to thousands of seconds. I know this sounds contrived, but it was in fact *derived* from a real-life problem. All I am hoping to convey is that there are memory intensive applications for which thread-preferred arenas actually hurt performance significantly. Furthermore, turning on MALLOC_PER_THREAD can actually have an even more devastating effect on these applications than the default behavior. And unfortunately, neither MALLOC_ARENA_MAX nor MALLOC_ARENA_TEST can prevent the thread-preferred arena proliferation. The test run output without and with "-x" option are below; the source code is below that. Thank you for your time. Like I said, I won't reopen this again, but I hope you'll consider giving applications like ours a "way out" of the thread- preferred arenas in the future -- especially since it seems our future is even more bleak with MALLOC_PER_THREAD, and that's the way you are moving (and for certain applications, MALLOC_PER_THREAD makes sense!). Anyway, I've already written a small block binned allocator that will live on top of mmap'd pages for us for Linux, so we're OK. But I'd rather just use malloc(3). -- Rich [root@lab2-160 test_heap]# ./memx2 cpus = 8; pages = 3072694; pagesize = 4096 nallocs = 307200 --- creating 100 threads --- --- waiting for threads to allocate memory --- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 --- malloc_stats() --- Arena 0: system bytes = 1557606400 in use bytes = 743366944 Total (incl. mmap): system bytes = 1562529792 in use bytes = 748290336 max mmap regions = 2 max mmap bytes = 4923392 --- cat /proc/29565/status | grep -i vm --- VmPeak: 9961304 kB VmSize: 9951060 kB VmLck: 0 kB VmHWM: 2517656 kB VmRSS: 2517656 kB VmData: 9945304 kB VmStk: 84 kB VmExe: 8 kB VmLib: 1532 kB VmPTE: 19432 kB --- accessing memory --- --- done in 3 seconds --- [root@lab2-160 test_heap]# ./memx2 -x cpus = 8; pages = 3072694; pagesize = 4096 nallocs = 307200 --- creating 100 threads --- --- allowing threads to create preferred arenas --- --- waiting for threads to allocate memory --- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 --- malloc_stats() --- Arena 0: system bytes = 1264455680 in use bytes = 505209392 Arena 1: system bytes = 1344937984 in use bytes = 653695200 Arena 2: system bytes = 1396580352 in use bytes = 705338800 Arena 3: system bytes = 1195057152 in use bytes = 503815408 Arena 4: system bytes = 1295818752 in use bytes = 604577136 Arena 5: system bytes = 1094295552 in use bytes = 403053744 Arena 6: system bytes = 1245437952 in use bytes = 554196272 Arena 7: system bytes = 1144676352 in use bytes = 453434608 Arena 8: system bytes = 1346199552 in use bytes = 654958000 Total (incl. mmap): system bytes = 2742448128 in use bytes = 748234656 max mmap regions = 2 max mmap bytes = 4923392 --- cat /proc/29669/status | grep -i vm --- VmPeak: 49213720 kB VmSize: 49182988 kB VmLck: 0 kB VmHWM: 12052384 kB VmRSS: 11861284 kB VmData: 49177232 kB VmStk: 84 kB VmExe: 8 kB VmLib: 1532 kB VmPTE: 95452 kB --- accessing memory --- 60 secs... 120 secs... 180 secs... 240 secs... 300 secs... 360 secs... 420 secs... 480 secs... 540 secs... 600 secs... 660 secs... 720 secs... 780 secs... --- done in 818 seconds --- [root@lab2-160 test_heap]# [root@lab2-160 test_heap]# cat memx2.c // **************************************************************************** #include <stdio.h> #include <errno.h> #include <assert.h> #include <limits.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <pthread.h> #include <inttypes.h> #define NTHREADS 100 #define ALLOCSIZE 16384 #define STRAGGLERS 100 static uint cpus; static uint pages; static uint pagesize; static uint nallocs; static volatile int go; static volatile int done; static volatile int spin; static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static void **ps; // allocations that are freed in turn by each thread static int nps; static void **ss; // straggling allocations to prevent arena free static int nss; void my_sleep( int ms ) { int rv; struct timespec ts; struct timespec rem; ts.tv_sec = ms / 1000; ts.tv_nsec = (ms % 1000) * 1000000; for (;;) { rv = nanosleep(&ts, &rem); if (! rv) { break; } assert(errno == EINTR); ts = rem; } } void * my_thread( void *context ) { int i; int n; int si; int rv; void *p; n = (int)(intptr_t)context; while (! go) { my_sleep(100); } // first we spin to get our own arena while (spin) { p = malloc(ALLOCSIZE); assert(p); if (rand()%20000 == 0) { my_sleep(10); } free(p); } my_sleep(1000); // then one thread at a time, do our big allocs rv = pthread_mutex_lock(&mutex); assert(! rv); for (i = 0; i < nallocs; i++) { assert(i < nps); ps[i] = malloc(ALLOCSIZE); assert(ps[i]); } // N.B. we leave 1 of every STRAGGLERS allocations straggling for (i = 0; i < nallocs; i++) { assert(i < nps); if (i%STRAGGLERS == 0) { si = nallocs/STRAGGLERS*n + i/STRAGGLERS; assert(si < nss); ss[si] = ps[i]; } else { free(ps[i]); } } done++; printf("%d ", done); fflush(stdout); rv = pthread_mutex_unlock(&mutex); assert(! rv); } int main(int argc, char **argv) { int i; int rv; time_t n; time_t t; time_t lt; pthread_t thread; char command[128]; if (argc > 1) { if (! strcmp(argv[1], "-x")) { spin = 1; argc--; argv++; } } if (argc > 1) { printf("usage: memx2 [-x]\n"); return 1; } cpus = sysconf(_SC_NPROCESSORS_CONF); pages = sysconf (_SC_PHYS_PAGES); pagesize = sysconf (_SC_PAGESIZE); printf("cpus = %d; pages = %d; pagesize = %d\n", cpus, pages, pagesize); nallocs = pages/10/STRAGGLERS*STRAGGLERS; assert(! (nallocs%STRAGGLERS)); printf("nallocs = %d\n", nallocs); nps = nallocs; ps = malloc(nps*sizeof(*ps)); assert(ps); nss = NTHREADS*nallocs/STRAGGLERS; ss = malloc(nss*sizeof(*ss)); assert(ss); if (pagesize != 4096) { printf("WARNING -- this program expects 4096 byte pagesize!\n"); } printf("--- creating %d threads ---\n", NTHREADS); for (i = 0; i < NTHREADS; i++) { rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i); assert(! rv); rv = pthread_detach(thread); assert(! rv); } go = 1; if (spin) { printf("--- allowing threads to create preferred arenas ---\n"); my_sleep(5000); spin = 0; } printf("--- waiting for threads to allocate memory ---\n"); while (done != NTHREADS) { my_sleep(1000); } printf("\n"); printf("--- malloc_stats() ---\n"); malloc_stats(); sprintf(command, "cat /proc/%d/status | grep -i vm", (int)getpid()); printf("--- %s ---\n", command); (void)system(command); // access the stragglers printf("--- accessing memory ---\n"); t = time(NULL); lt = t; for (i = 0; i < nss; i++) { memset(ss[i], 0, ALLOCSIZE); n = time(NULL); if (n-lt >= 60) { printf("%d secs... ", (int)(n-t)); fflush(stdout); lt = n; } } if (lt != t) { printf("\n"); } printf("--- done in %d seconds ---\n", (int)(time(NULL)-t)); return 0; } [root@lab2-160 test_heap]#
I already described what you can do to limit the number of memory pools. Just use it. If you don't like envvars use the appropriate mallopt() calls (using M_ARENA_MAX and M_ARENA_TEST). No malloc implementation is optimal for all situations. This is why there are customization knobs.
Hi Ulrich, Agreed 100% no one size fits all... Unfortunately, the neither of the "tuning" settings for MALLOC_ARENA_MAX nor MALLOC_ARENA_TEST seem to work. Neither do mallopt() M_ARENA_MAX nor M_ARENA_TEST. :-( Part of the problem seems to stem from the fact that the global "narenas" is only incremented if MALLOC_PER_THREAD/use_per_thread is true... #ifdef PER_THREAD if (__builtin_expect (use_per_thread, 0)) { ++narenas; (void)mutex_unlock(&list_lock); } #endif So the tests of those other variables in reused_arena() never limit anything. And setting MALLOC_PER_THREAD makes our problem much worse. static mstate reused_arena (void) { if (narenas <= mp_.arena_test) return NULL; ... if (narenas < narenas_limit) return NULL; I also tried all combinations I could imagine of MALLOC_PER_THREAD and the other variables, to no avail. I also did the same with mallopt(), verifying at the assembly level that we got all the right values into mp_. :-( Specifically, I tried things like: export MALLOC_PER_THREAD=1 export MALLOC_ARENA_MAX=1 export MALLOC_ARENA_TEST=1 and: rv = mallopt(-7, 1); printf("%d\n", rv); rv = mallopt(-8, 1); printf("%d\n", rv); Anyway, thank you. You've already pointed me in all of the right directions. If I did something completely brain-dead, above, feel free to tell me and save me another few days of work! :-) -- Rich
And a comment for anyone else who might stumble this way... I *can* reduce the total number of arenas to *2* (not low enough for our purposes) with the following sequence: export MALLOC_PER_THREAD=1 rv = mallopt(-7, 1); // M_ARENA_TEST printf("%d\n", rv); rv = mallopt(-8, 1); // M_ARENA_MAX printf("%d\n", rv); *PLUS* I have to have a global pthread mutex around every malloc(3) and free (3) call -- I can't figure out from the code why this is required, but without it the number of arenas seems independent of the mallopt settings. I cannot get to *1* arena because a) mallopt() won't allow you to set arena_test to 0: #ifdef PER_THREAD case M_ARENA_TEST: if (value > 0) mp_.arena_test = value; break; case M_ARENA_MAX: if (value > 0) mp_.arena_max = value; break; #endif And b) reused_arena() uses a ">=" here rather than a ">": static mstate reused_arena (void) { if (narenas <= mp_.arena_test) return NULL;
Last mail... It turns out the arena_max and arena_test numbers are "fuzzy" (I am sure by design), since no lock is held here: static mstate internal_function arena_get2(mstate a_tsd, size_t size) { mstate a; #ifdef PER_THREAD if (__builtin_expect (use_per_thread, 0)) { if ((a = get_free_list ()) == NULL && (a = reused_arena ()) == NULL) /* Nothing immediately available, so generate a new arena. */ a = _int_new_arena(size); return a; } #endif Therefore, if narenas is less than the limit tested for in reused_arena(), and N threads get in to this code at once, narenas can then end up N-1 *above* the limit. The likelihood of this happening is proportional to the malloc arrival rate and the time spend in _int_new_arena(). This is exactly what I am seeing. So if you can live with 2 arenas, the critical thing to do is to make sure narenas is exactly 2 before going heavily multi-threaded, and then it won't be able to go above 2; otherwise, it can sneak up to 2+N-1, where N is the number of threads contending for allocations. If the ">=" in reused_arena() was changed to ">", then we could use this mechanism to limit narenas to exactly 1 right from the get-go. That would be ideal for our kind of applications (that can't live with 2 arenas).
We have exactly the same problem with the current implementation of malloc. The suggested solutions by Ulrich using M_ARENA_MAX does not work since the check for number of arenas is not thread safe. In fact the limit is not working for heay threading applications where that would be needed! Since the number of cores and usage of threads will increase strongly there should be a solution for that kind of applications! If the arena limit would work as described we would have no problem.
Hi, We ended up building our own memory allocator -- it's faster and more efficient than glibc, and it works equally fast with threads and wihout. We used the "small block allocator" concept from HP-UX where we only allocate huge (32MB) allocations from the system (after setting M_MMAP_THRESHOLD suitably small). We then carve out large *naturally aligned* 1MB blocks from the huge allocation (accepting 3% waste, since the allocation was page alogned to begin with, not naturally aligned). And we carve each one of those large blocks into small fixed size buckets (which are fractional powers of 2 -- like 16 bytes, 20, 24, 28, 32, 40, 48, 56, 64, 80, etc.). Then we put the aligned addresses into a very fast hash and have a linked list for each bucket size. This means our allocate routine is just a lock, linked list remove, unlock, on average, and our free routine is just a hash lookup, lock, linked list insert, unlock on average. The trick here is that from any address being freed, you can get back to the naturally aligned 1MB block that contains it with just a pointer mask, and from there you can get the allocation's size as well as the head of the linked list of free entries to which it should be returned... -- Rich ----- Original Message ----- From: heuler at infosim dot net To: rich@testardi.com Sent: Saturday, August 27, 2011 3:45 PM Subject: [Bug libc/11261] malloc uses excessive memory for multi-threaded applications http://sourceware.org/bugzilla/show_bug.cgi?id=11261 Marius Heuler <heuler at infosim dot net> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED CC| |heuler at infosim dot net Resolution|WONTFIX | --- Comment #8 from Marius Heuler <heuler at infosim dot net> 2011-08-27 21:45:04 UTC --- We have exactly the same problem with the current implementation of malloc. The suggested solutions by Ulrich using M_ARENA_MAX does not work since the check for number of arenas is not thread safe. In fact the limit is not working for heay threading applications where that would be needed! Since the number of cores and usage of threads will increase strongly there should be a solution for that kind of applications! If the arena limit would work as described we would have no problem. -- Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug.
Created attachment 5917 [details] Memory consumption with glibc malloc and jeMalloc (straight line).
Comment on attachment 5917 [details] Memory consumption with glibc malloc and jeMalloc (straight line). We now changed to another malloc implementation: jeMalloc (http://www.canonware.com/jemalloc/) which is a magnitude superior to the glibc malloc. A similar implementation is also used in *BSD variants! Linux/glibc should really improve their malloc since the current implementation is not sufficient for large applications. Why can't this implemenetion be used inside glibc? Is it GPL <-> BSD license problem?
Stop reopening. There is a solution for people who are stupid enough to create too many threads. No implementation will be perfect for everyone. The glibc implementation is tuned for reasonable programs and will run much faster than any other I tested.
Let's all not take things so personally -- nobody here is stupid (and I'm sure some folks here are a *lot* smarter than other folks give them credit for)... There are lots of reasons to create a half dozen threads and that's all it takes to make the glibc version perform absolutely horribly. (And there can be no non-objective measurement that won't show my version of malloc is faster than yours -- so this has been a win all around for us, thanks...) If you're not interested in improving glibc, you can just say so. But stop name calling when you feel threatened -- my 5 year old daughter has already outgrown that. -- Rich -----Original Message----- From: drepper.fsp at gmail dot com Sent: Sunday, September 11, 2011 9:46 AM To: rich@testardi.com Subject: [Bug libc/11261] malloc uses excessive memory for multi-threaded applications http://sourceware.org/bugzilla/show_bug.cgi?id=11261 Ulrich Drepper <drepper.fsp at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |WORKSFORME --- Comment #12 from Ulrich Drepper <drepper.fsp at gmail dot com> 2011-09-11 15:46:13 UTC --- Stop reopening. There is a solution for people who are stupid enough to create too many threads. No implementation will be perfect for everyone. The glibc implementation is tuned for reasonable programs and will run much faster than any other I tested.
Ulrich Drepper, this huge virtual memory allocator could be a potential trouble maker on Linux6 with 64bit JVM. There is already one document on hadoop regarding to this issue, while their solution by setting MALLOC_ARENA_MAX=4 has no effect. we still found JVM with 30G virtual memory reported. https://issues.apache.org/jira/browse/HADOOP-7154
This should have been fixed by the following commit: commit 41b81892f11fe1353123e892158b53de73863d62 Author: Ulrich Drepper <drepper@gmail.com> Date: Tue Jan 31 14:42:34 2012 -0500 Handle ARENA_TEST correctly I have verified that using `mallopt (M_ARENA_MAX, 1)' that the limit of memory is bounded by the single arena. creating 10 threads allowing threads to contend to create preferred arenas display preferred arenas Arena 0: system bytes = 135168 in use bytes = 2880 Total (incl. mmap): system bytes = 135168 in use bytes = 2880 max mmap regions = 0 max mmap bytes = 0 allowing threads to allocate 100MB each, sequentially in turn thread 0 alloc 100MB thread 0 free 100MB-20kB thread 4 alloc 100MB thread 4 free 100MB-20kB thread 9 alloc 100MB thread 9 free 100MB-20kB thread 5 alloc 100MB thread 5 free 100MB-20kB thread 2 alloc 100MB thread 2 free 100MB-20kB thread 7 alloc 100MB thread 7 free 100MB-20kB thread 1 alloc 100MB thread 1 free 100MB-20kB thread 8 alloc 100MB thread 8 free 100MB-20kB thread 6 alloc 100MB thread 6 free 100MB-20kB thread 3 alloc 100MB thread 3 free 100MB-20kB Arena 0: system bytes = 100392960 in use bytes = 201472 Total (incl. mmap): system bytes = 100392960 in use bytes = 201472 max mmap regions = 0 max mmap bytes = 0 Therefore the solution to a program with lots of threads is to limit the arenas as a trade-off for memory.
> Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory. That is a bandaid not a solution. Still there is no memory returned to system when one first does allocations and then allocates auxiliary memory like void *calculate () { void **ary = malloc (1000000 * sizeof (void *)) for (i = 0; i < 1000000; i++) ary[i] = malloc (100); for (i = 0; i < 999999; i++) free (ary [i]); return ary[999999]; } When one acknowledges a bug a solution is relatively simple. Add a flag UNMAPPED for chunks which means that all pages completely contained in chunk were zeroed by madvise(s, n, MADV_DONTNEED). You keep track of memory used and system and when their ratio is bigger than two you make chunks starting from largest ones UNMAPPED to decrease system charge. This deals with RSS problem. A virtual space usage could still be excesive but that is smaller problem.
(In reply to Ondrej Bilka from comment #16) > > Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory. > > That is a bandaid not a solution. Still there is no memory returned to > system when one first does allocations and then allocates auxiliary memory > like You have not understood the bug report. > void *calculate () > { > void **ary = malloc (1000000 * sizeof (void *)) > for (i = 0; i < 1000000; i++) ary[i] = malloc (100); > for (i = 0; i < 999999; i++) free (ary [i]); > return ary[999999]; > } This is a different problem from the current bug report, which is about too many arenas getting created resulting in excessive address space usage and the MALLOC_ARENA_* variables not working to limit them. Memory holes not being freed has nothing to do with it. > When one acknowledges a bug a solution is relatively simple. Add a flag > UNMAPPED for chunks which means that all pages completely contained in chunk > were zeroed by madvise(s, n, MADV_DONTNEED). > > You keep track of memory used and system and when their ratio is bigger than > two you make chunks starting from largest ones UNMAPPED to decrease system > charge. > > This deals with RSS problem. A virtual space usage could still be excesive > but that is smaller problem. The problem you've described is different and I'm sure there's a bug report open for it too. madvise is not sufficient to free up commit charge; there's a mail thread on libc-alpha that discusses this problem that you can search for and read up on. I think vm.overcommit_memory is one of the keywords to look for.
On Thu, Dec 12, 2013 at 03:31:58AM +0000, siddhesh at redhat dot com wrote: > https://sourceware.org/bugzilla/show_bug.cgi?id=11261 > > Siddhesh Poyarekar <siddhesh at redhat dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |siddhesh at redhat dot com > > --- Comment #17 from Siddhesh Poyarekar <siddhesh at redhat dot com> --- > (In reply to Ondrej Bilka from comment #16) > > > Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory. > > > > That is a bandaid not a solution. Still there is no memory returned to > > system when one first does allocations and then allocates auxiliary memory > > like > > You have not understood the bug report. > When you read discussion more carefully there are following posts where this problem is mentioned: Ulrich Drepper: You don't understand the difference between address space and allocated memory. Rich Testardi: Actually, I totally understand the difference and that is why I mentioned the fragmentation of memory... When each arena has just a few straggling allocations, the maximum *committed* RAM required for the program's *working set* using the thread-preferred arena model is, in fact, N times that required for a traditional model, where N is the number of threads. This shows up in real-world thrashing that could actually be avoided. Basically, if the program is doing small allocations, a small percentage of stragglers can pin the entire allocated space -- and the allocated space is, in fact, much larger than it needs to be (and larger than it is in other OS's). But thank you for
(In reply to Ondrej Bilka from comment #18) > When you read discussion more carefully there are following posts where > this problem is mentioned: > > > Ulrich Drepper: > > You don't understand the difference between address space and allocated > memory. > > Rich Testardi: > > Actually, I totally understand the difference and that is why I mentioned > the > fragmentation of memory... When each arena has just a few straggling > allocations, the maximum *committed* RAM required for the program's *working > set* using the thread-preferred arena model is, in fact, N times that > required > for a traditional model, where N is the number of threads. This shows up in > real-world thrashing that could actually be avoided. Basically, if the > program is doing small allocations, a small percentage of stragglers can pin > the entire allocated space -- and the allocated space is, in fact, much > larger > than it needs to be (and larger than it is in other OS's). But thank you for Right, but most comments on the bug report (and the resolution) are in the context of malloc creating too many arenas and the switches not working. Single allocations blocking an entire free space is not a multi-threaded problem - it occurs on single-threads too and is only compounded with multiple arenas. I'd suggest working with a fresh bug report or an open bug report that describes this problem exactly (which I'm pretty sure there should be).
*** Bug 260998 has been marked as a duplicate of this bug. *** Seen from the domain http://volichat.com Page where seen: http://volichat.com/adult-chat-rooms Marked for reference. Resolved as fixed @bugzilla.
I'm marking this fixed, since the tunnables that limit arena creation are fixed. You can limit the number of arenas in your application at the cost of thread contention during allocation (increased malloc latency). This does however limit the total VA usage. This is particularly true of 32-bit applications running close to the 32-bit VA limit.