Created attachment 6725 [details] test-case free() isn't calling brk() to give memory back to the kernel when M_TRIM_THRESHOLD is passed. Run the attached test-case. What it does: 1. Calls malloc() 2800000 times 2. Calls free() 2800000 times 3. pauses, so you can inspect the heap size. You'll see that the heap size is around 250 MB. Manually calling malloc_trim(), through gdb, decreases the heap size to 4 K. ---------------------------------------------------- How I measured heap size: $ cat /proc/12345/maps | grep heap 01bc6000-0f180000 rw-p 00000000 00:00 0 [heap] $ python > (0x0f180000-0x01bc6000) / (1024*1024) > 213 213 Megabytes $ top -p12345 # tested with top too 227m 214m for VIRT and RES respectively $ gdb -pid 12345 # Lets attach gdb and call malloc_trim() > call malloc_trim(0) $ top -p12345 14492 1076 for VIRT and RES respectively $ cat /proc/12345/maps | grep heap 01bc6000-01bc7000 rw-p 00000000 00:00 0 [heap] $ python > (0x01bc7000-0x01bc6000) / (1024*1024) > 0.00390625 // 4KB ------------------------------------------------------------ I'm on Linux 3.6.5 with glibc-2.16
This seems to be caused due to the "fastbins" features. free() doesn't trim fastbins because the malloc() was less than M_MXFAST. But there really should be a limit to the number of fastbins that we keep around. In KDE we've seen 600MB of memory being freed after attaching gdb and calling malloc_trim(0)
I can reproduce this issue and think its also an issue for KDevelop and similar apps. What else is needed to improve the situation here?
The way we are going to improve this situation is IMO by moving from fastbins to per-thread caches, and those caches will have a size limit to limit RSS growth. The fastbins are not as fast as lockless per-thread caches which is the common implementation across tcmalloc and jemalloc. We already have an implementation in dj/malloc which has been proposed and posted.
I saw branch dj/malloc and i don't think it can fix the issue. The main problem is in _int_free (mstate av, mchunkptr p, int have_lock) if ((unsigned long)(size) <= (unsigned long)(get_max_fast ()) <------------- if all chunks size are lower than M_MXFAST this code *never* release consolidate memory. This is serious issue after all. For me this must be if all_unused_size >= 2*FASTBIN_CONSOLIDATION_THRESHOLD we must free one of FASTBIN_CONSOLIDATION_THRESHOLD. I can make patch, when i realise how to get all consolidate chunk size.
Any news on this? This issue touchs many of applications and libraries. Moreover, Anthony pointed the place of this bug.
Ok, let's make M_MXFAST tunable and add M_MXFAST_ environment variable?
(In reply to Aleksandr Kurakin from comment #6) > Ok, let's make M_MXFAST tunable and add M_MXFAST_ environment variable? We now have glibc.malloc.mxfast tunable so you can test this out without adding code or preloading a library that calls mallopt with M_MXFAST setting to 0.
(In reply to Carlos O'Donell from comment #7) > We now have glibc.malloc.mxfast tunable so you can test this out without > adding code or preloading a library that calls mallopt with M_MXFAST setting > to 0. Thanks very much!
What version of glibc does this require? The new tunable isn't yet documented on https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html