Bug 11261 - malloc uses excessive memory for multi-threaded applications
Summary: malloc uses excessive memory for multi-threaded applications
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: malloc (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-08 20:23 UTC by Rich Testardi
Modified: 2016-09-29 20:12 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed: 2013-12-12 00:00:00
fweimer: security-


Attachments
Memory consumption with glibc malloc and jeMalloc (straight line). (171.26 KB, image/png)
2011-09-02 07:38 UTC, Marius Heuler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rich Testardi 2010-02-08 20:23:39 UTC
malloc uses excessive memory for multi-threaded applications

The following program demonstrates malloc(3) using in excess of 600 megabytes 
of system memory while the program has never allocated more than 100 megabytes 
at any given time.  This results from the use of thread-specific "preferred 
arenas" for memory allocations.

The program first starts by contending a number of threads doing simple 
malloc/frees, with no net memory allocations.  This establishes preferred 
arenas for each thread as a result of USE_ARENAS and PER_THREADS.  Once 
preferred arenas are established, the program then has each thread, in turn, 
allocate 100 megabytes and then free all but 20 kilobytes, for a net memory 
allocation of 200 kilobytes.  The resulting malloc_stats() show 600 megabytes 
of allocated memory that cannot be returned to the system.

Over time, fragmentation of the heap can cause excessive paging when actual 
memory allocation never exceeded system capacity.  With the use of preferred 
arenas in this way, multi-threaded program memory usage is essentially 
unbounded (or bounded to the number of threads times the actual memory usage).

The program run and source code is below, as well as the glibc version from my 
RHEL5 system.  Thank you for your consideration.

[root@lab2-160 test_heap]# ./memx
creating 10 threads
allowing threads to contend to create preferred arenas
display preferred arenas
Arena 0:
system bytes     =     135168
in use bytes     =       2880
Arena 1:
system bytes     =     135168
in use bytes     =       2224
Arena 2:
system bytes     =     135168
in use bytes     =       2224
Arena 3:
system bytes     =     135168
in use bytes     =       2224
Arena 4:
system bytes     =     135168
in use bytes     =       2224
Arena 5:
system bytes     =     135168
in use bytes     =       2224
Total (incl. mmap):
system bytes     =     811008
in use bytes     =      14000
max mmap regions =          0
max mmap bytes   =          0
allowing threads to allocate 100MB each, sequentially in turn
thread 3 alloc 100MB
thread 3 free 100MB-20kB
thread 5 alloc 100MB
thread 5 free 100MB-20kB
thread 7 alloc 100MB
thread 7 free 100MB-20kB
thread 2 alloc 100MB
thread 2 free 100MB-20kB
thread 0 alloc 100MB
thread 0 free 100MB-20kB
thread 8 alloc 100MB
thread 8 free 100MB-20kB
thread 4 alloc 100MB
thread 4 free 100MB-20kB
thread 6 alloc 100MB
thread 6 free 100MB-20kB
thread 9 alloc 100MB
thread 9 free 100MB-20kB
thread 1 alloc 100MB
thread 1 free 100MB-20kB
Arena 0:
system bytes     =  100253696
in use bytes     =      40928
Arena 1:
system bytes     =  100184064
in use bytes     =      42352
Arena 2:
system bytes     =  100163584
in use bytes     =      22320
Arena 3:
system bytes     =  100163584
in use bytes     =      22320
Arena 4:
system bytes     =  100163584
in use bytes     =      22320
Arena 5:
system bytes     =  100204544
in use bytes     =      62384
Total (incl. mmap):
system bytes     =  601133056
in use bytes     =     212624
max mmap regions =          0
max mmap bytes   =          0
[root@lab2-160 test_heap]# rpm -q glibc
glibc-2.5-42.el5_4.2
glibc-2.5-42.el5_4.2
[root@lab2-160 test_heap]# 

====================================================================

[root@lab2-160 test_heap]# cat memx.c
// ****************************************************************************

#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
#include <inttypes.h>

#define NTHREADS  10
#define NALLOCS  10000
#define ALLOCSIZE  10000

static volatile int go;
static volatile int die;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static void *ps[NALLOCS];  // allocations that are freed in turn by each thread
static void *pps1[NTHREADS];  // straggling allocations to prevent arena free
static void *pps2[NTHREADS];  // straggling allocations to prevent arena free

void
my_sleep(
    int ms
    )
{
    int rv;
    struct timespec ts;
    struct timespec rem;

    ts.tv_sec  = ms / 1000;
    ts.tv_nsec = (ms % 1000) * 1000000;
    for (;;) {
        rv = nanosleep(&ts, &rem);
        if (! rv) {
            break;
        }
        assert(errno == EINTR);
        ts = rem;
    }
}

void *
my_thread(
    void *context
    )
{
    int i;
    int rv;
    void *p;

    // first we spin to get our own arena
    while (go == 0) {
        p = malloc(ALLOCSIZE);
        assert(p);
        if (rand()%20000 == 0) {
            my_sleep(10);
        }
        free(p);
    }

    // then we give main a chance to print stats
    while (go == 1) {
        my_sleep(100);
    }
    assert(go == 2);

    // then one thread at a time, do our big allocs
    rv = pthread_mutex_lock(&mutex);
    assert(! rv);
    printf("thread %d alloc 100MB\n", (int)(intptr_t)context);
    for (i = 0; i < NALLOCS; i++) {
        ps[i] = malloc(ALLOCSIZE);
        assert(ps[i]);
    }
    printf("thread %d free 100MB-20kB\n", (int)(intptr_t)context);
    // N.B. we leave two allocations straggling
    pps1[(int)(intptr_t)context] = ps[0];
    for (i = 1; i < NALLOCS-1; i++) {
        free(ps[i]);
    }
    pps2[(int)(intptr_t)context] = ps[i];
    rv = pthread_mutex_unlock(&mutex);
    assert(! rv);
}

int
main()
{
    int i;
    int rv;
    pthread_t thread;

    printf("creating %d threads\n", NTHREADS);
    for (i = 0; i < NTHREADS; i++) {
        rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i);
        assert(! rv);
        rv = pthread_detach(thread);
        assert(! rv);
    }

    printf("allowing threads to contend to create preferred arenas\n");
    my_sleep(20000);

    printf("display preferred arenas\n");
    go = 1;
    my_sleep(1000);
    malloc_stats();

    printf("allowing threads to allocate 100MB each, sequentially in turn\n");
    go = 2;
    my_sleep(5000);
    malloc_stats();

    // free the stragglers
    for (i = 0; i < NTHREADS; i++) {
        free(pps1[i]);
        free(pps2[i]);
    }

    return 0;
}
[root@lab2-160 test_heap]#
Comment 1 Ulrich Drepper 2010-02-09 15:28:34 UTC
You don't understand the difference between address space and allocated memory.
 The cost of large amounts of allocated address space is insignificant.

If you don't want it control it using the MALLOC_ARENA_MAX and MALLOC_ARENA_TEST
envvars.
Comment 2 Rich Testardi 2010-02-09 16:01:58 UTC
Actually, I totally understand the difference and that is why I mentioned the 
fragmentation of memory...  When each arena has just a few straggling 
allocations, the maximum *committed* RAM required for the program's *working 
set* using the thread-preferred arena model is, in fact, N times that required 
for a traditional model, where N is the number of threads.  This shows up in 
real-world thrashing that could actually be avoided.  Basically, if the 
program is doing small allocations, a small percentage of stragglers can pin 
the entire allocated space -- and the allocated space is, in fact, much larger 
than it needs to be (and larger than it is in other OS's).  But thank you for 
your time -- we all want the same thing here, a ever better Linux that is more 
suited to heavily threaded applications. :-)
Comment 3 Rich Testardi 2010-02-10 13:10:18 UTC
Hi Ulrich,

I apologize in advance and want you to know I will not reopen this bug again, 
but I felt I had to show you a new test program that clearly shows "The cost 
of large amounts of allocated address space is insignificant" can be 
exceedingly untrue for heavily threaded systems using large amounts of 
memory.  In our product, we require 2x the RAM on Linux vs other OS's because 
of this. :-(

I've reduced the problem to a program that you can invoke with no options and 
it runs fine, but with the "-x" option it thrashes wildly.  The only 
difference is that in the "-x" case we allow the threads to do some dummy 
malloc/frees up front to create thread-preferred arenas.

The program simply has a bunch of threads that, in turn (i.e., not 
concurrently), allocate a bunch of memory, and then free most (but not all!) 
of it.  The resulting allocations easily fit in RAM, even when fragmented.  It 
then attempts to memset the unfreed memory to 0.

The problem is that in the thread-preferred arena case, the fragmented 
allocations are now spread over 10x the virtual space, and when accessed, 
result in actual commitment of at least 2x the physical space -- enough to 
push us over the top of RAM and into thrashing.

So as a result, without the -x option, the program memset runs in two seconds 
or so on my system (8-way, 2GHz, 12GB RAM); with the -x option, the program 
memset can take hundreds to thousands of seconds.

I know this sounds contrived, but it was in fact *derived* from a real-life 
problem.

All I am hoping to convey is that there are memory intensive applications for 
which thread-preferred arenas actually hurt performance significantly.  
Furthermore, turning on MALLOC_PER_THREAD can actually have an even more 
devastating effect on these applications than the default behavior.  And 
unfortunately, neither MALLOC_ARENA_MAX nor MALLOC_ARENA_TEST can prevent the 
thread-preferred arena proliferation.

The test run output without and with "-x" option are below; the source code is 
below that.

Thank you for your time.  Like I said, I won't reopen this again, but I hope 
you'll consider giving applications like ours a "way out" of the thread-
preferred arenas in the future -- especially since it seems our future is even 
more bleak with MALLOC_PER_THREAD, and that's the way you are moving (and for 
certain applications, MALLOC_PER_THREAD makes sense!).

Anyway, I've already written a small block binned allocator that will live on 
top of mmap'd pages for us for Linux, so we're OK.  But I'd rather just use 
malloc(3).

-- Rich

[root@lab2-160 test_heap]# ./memx2
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes     = 1557606400
in use bytes     =  743366944
Total (incl. mmap):
system bytes     = 1562529792
in use bytes     =  748290336
max mmap regions =          2
max mmap bytes   =    4923392
--- cat /proc/29565/status | grep -i vm ---
VmPeak:  9961304 kB
VmSize:  9951060 kB
VmLck:         0 kB
VmHWM:   2517656 kB
VmRSS:   2517656 kB
VmData:  9945304 kB
VmStk:        84 kB
VmExe:         8 kB
VmLib:      1532 kB
VmPTE:     19432 kB
--- accessing memory ---
--- done in 3 seconds ---


[root@lab2-160 test_heap]# ./memx2 -x
cpus = 8; pages = 3072694; pagesize = 4096
nallocs = 307200
--- creating 100 threads ---
--- allowing threads to create preferred arenas ---
--- waiting for threads to allocate memory ---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
--- malloc_stats() ---
Arena 0:
system bytes     = 1264455680
in use bytes     =  505209392
Arena 1:
system bytes     = 1344937984
in use bytes     =  653695200
Arena 2:
system bytes     = 1396580352
in use bytes     =  705338800
Arena 3:
system bytes     = 1195057152
in use bytes     =  503815408
Arena 4:
system bytes     = 1295818752
in use bytes     =  604577136
Arena 5:
system bytes     = 1094295552
in use bytes     =  403053744
Arena 6:
system bytes     = 1245437952
in use bytes     =  554196272
Arena 7:
system bytes     = 1144676352
in use bytes     =  453434608
Arena 8:
system bytes     = 1346199552
in use bytes     =  654958000
Total (incl. mmap):
system bytes     = 2742448128
in use bytes     =  748234656
max mmap regions =          2
max mmap bytes   =    4923392
--- cat /proc/29669/status | grep -i vm ---
VmPeak: 49213720 kB
VmSize: 49182988 kB
VmLck:         0 kB
VmHWM:  12052384 kB
VmRSS:  11861284 kB
VmData: 49177232 kB
VmStk:        84 kB
VmExe:         8 kB
VmLib:      1532 kB
VmPTE:     95452 kB
--- accessing memory ---
60 secs... 120 secs... 180 secs... 240 secs... 300 secs... 360 secs... 420 
secs... 480 secs... 540 secs... 600 secs... 660 secs... 720 secs... 780 secs...
--- done in 818 seconds ---
[root@lab2-160 test_heap]#


[root@lab2-160 test_heap]# cat memx2.c
// ****************************************************************************

#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <inttypes.h>

#define NTHREADS  100
#define ALLOCSIZE  16384
#define STRAGGLERS  100

static uint cpus;
static uint pages;
static uint pagesize;

static uint nallocs;

static volatile int go;
static volatile int done;
static volatile int spin;
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

static void **ps;  // allocations that are freed in turn by each thread
static int nps;
static void **ss;  // straggling allocations to prevent arena free
static int nss;

void
my_sleep(
    int ms
    )
{
    int rv;
    struct timespec ts;
    struct timespec rem;

    ts.tv_sec  = ms / 1000;
    ts.tv_nsec = (ms % 1000) * 1000000;
    for (;;) {
        rv = nanosleep(&ts, &rem);
        if (! rv) {
            break;
        }
        assert(errno == EINTR);
        ts = rem;
    }
}

void *
my_thread(
    void *context
    )
{
    int i;
    int n;
    int si;
    int rv;
    void *p;

    n = (int)(intptr_t)context;

    while (! go) {
        my_sleep(100);
    }

    // first we spin to get our own arena
    while (spin) {
        p = malloc(ALLOCSIZE);
        assert(p);
        if (rand()%20000 == 0) {
            my_sleep(10);
        }
        free(p);
    }

    my_sleep(1000);

    // then one thread at a time, do our big allocs
    rv = pthread_mutex_lock(&mutex);
    assert(! rv);
    for (i = 0; i < nallocs; i++) {
        assert(i < nps);
        ps[i] = malloc(ALLOCSIZE);
        assert(ps[i]);
    }
    // N.B. we leave 1 of every STRAGGLERS allocations straggling
    for (i = 0; i < nallocs; i++) {
        assert(i < nps);
        if (i%STRAGGLERS == 0) {
            si = nallocs/STRAGGLERS*n + i/STRAGGLERS;
            assert(si < nss);
            ss[si] = ps[i];
        } else {
            free(ps[i]);
        }
    }
    done++;
    printf("%d ", done);
    fflush(stdout);
    rv = pthread_mutex_unlock(&mutex);
    assert(! rv);
}

int
main(int argc, char **argv)
{
    int i;
    int rv;
    time_t n;
    time_t t;
    time_t lt;
    pthread_t thread;
    char command[128];


    if (argc > 1) {
        if (! strcmp(argv[1], "-x")) {
            spin = 1;
            argc--;
            argv++;
        }
    }
    if (argc > 1) {
        printf("usage: memx2 [-x]\n");
        return 1;
    }

    cpus = sysconf(_SC_NPROCESSORS_CONF);
    pages = sysconf (_SC_PHYS_PAGES);
    pagesize = sysconf (_SC_PAGESIZE);
    printf("cpus = %d; pages = %d; pagesize = %d\n", cpus, pages, pagesize);

    nallocs = pages/10/STRAGGLERS*STRAGGLERS;
    assert(! (nallocs%STRAGGLERS));
    printf("nallocs = %d\n", nallocs);

    nps = nallocs;
    ps = malloc(nps*sizeof(*ps));
    assert(ps);
    nss = NTHREADS*nallocs/STRAGGLERS;
    ss = malloc(nss*sizeof(*ss));
    assert(ss);

    if (pagesize != 4096) {
        printf("WARNING -- this program expects 4096 byte pagesize!\n");
    }

    printf("--- creating %d threads ---\n", NTHREADS);
    for (i = 0; i < NTHREADS; i++) {
        rv = pthread_create(&thread, NULL, my_thread, (void *)(intptr_t)i);
        assert(! rv);
        rv = pthread_detach(thread);
        assert(! rv);
    }
    go = 1;

    if (spin) {
        printf("--- allowing threads to create preferred arenas ---\n");
        my_sleep(5000);
        spin = 0;
    }

    printf("--- waiting for threads to allocate memory ---\n");
    while (done != NTHREADS) {
        my_sleep(1000);
    }
    printf("\n");

    printf("--- malloc_stats() ---\n");
    malloc_stats();
    sprintf(command, "cat /proc/%d/status | grep -i vm", (int)getpid());
    printf("--- %s ---\n", command);
    (void)system(command);

    // access the stragglers
    printf("--- accessing memory ---\n");
    t = time(NULL);
    lt = t;
    for (i = 0; i < nss; i++) {
        memset(ss[i], 0, ALLOCSIZE);
        n = time(NULL);
        if (n-lt >= 60) {
            printf("%d secs... ", (int)(n-t));
            fflush(stdout);
            lt = n;
        }
    }
    if (lt != t) {
        printf("\n");
    }
    printf("--- done in %d seconds ---\n", (int)(time(NULL)-t));

    return 0;
}
[root@lab2-160 test_heap]#
Comment 4 Ulrich Drepper 2010-02-10 13:21:08 UTC
I already described what you can do to limit the number of memory pools.  Just
use it.  If you don't like envvars use the appropriate mallopt() calls (using
M_ARENA_MAX and M_ARENA_TEST).

No malloc implementation is optimal for all situations.  This is why there are
customization knobs.
Comment 5 Rich Testardi 2010-02-10 13:41:58 UTC
Hi Ulrich,

Agreed 100% no one size fits all...

Unfortunately, the neither of the "tuning" settings for MALLOC_ARENA_MAX nor 
MALLOC_ARENA_TEST seem to work.  Neither do mallopt() M_ARENA_MAX nor 
M_ARENA_TEST. :-(

Part of the problem seems to stem from the fact that the global "narenas" is 
only incremented if MALLOC_PER_THREAD/use_per_thread is true...

#ifdef PER_THREAD
  if (__builtin_expect (use_per_thread, 0)) {
    ++narenas;

    (void)mutex_unlock(&list_lock);
  }
#endif

So the tests of those other variables in reused_arena() never limit anything.  
And setting MALLOC_PER_THREAD makes our problem much worse.

static mstate
reused_arena (void)
{
  if (narenas <= mp_.arena_test)
    return NULL;

  ...

  if (narenas < narenas_limit)
    return NULL;

I also tried all combinations I could imagine of MALLOC_PER_THREAD and the 
other variables, to no avail.  I also did the same with mallopt(), verifying 
at the assembly level that we got all the right values into mp_. :-(

Specifically, I tried things like:

export MALLOC_PER_THREAD=1
export MALLOC_ARENA_MAX=1
export MALLOC_ARENA_TEST=1

and:

    rv = mallopt(-7, 1);
    printf("%d\n", rv);
    rv = mallopt(-8, 1);
    printf("%d\n", rv);

Anyway, thank you.  You've already pointed me in all of the right directions.  
If I did something completely brain-dead, above, feel free to tell me and save 
me another few days of work! :-)

-- Rich
Comment 6 Rich Testardi 2010-02-10 14:29:06 UTC
And a comment for anyone else who might stumble this way...

I *can* reduce the total number of arenas to *2* (not low enough for our 
purposes) with the following sequence:

export MALLOC_PER_THREAD=1

    rv = mallopt(-7, 1);  // M_ARENA_TEST
    printf("%d\n", rv);
    rv = mallopt(-8, 1);  // M_ARENA_MAX
    printf("%d\n", rv);

*PLUS* I have to have a global pthread mutex around every malloc(3) and free
(3) call -- I can't figure out from the code why this is required, but without 
it the number of arenas seems independent of the mallopt settings.

I cannot get to *1* arena because a) mallopt() won't allow you to set 
arena_test to 0:

#ifdef PER_THREAD
  case M_ARENA_TEST:
    if (value > 0)
      mp_.arena_test = value;
    break;

  case M_ARENA_MAX:
    if (value > 0)
      mp_.arena_max = value;
    break;
#endif

And b) reused_arena() uses a ">=" here rather than a ">":

static mstate
reused_arena (void)
{
  if (narenas <= mp_.arena_test)
    return NULL;


Comment 7 Rich Testardi 2010-02-10 15:52:34 UTC
Last mail...

It turns out the arena_max and arena_test numbers are "fuzzy" (I am sure by 
design), since no lock is held here:

static mstate
internal_function
arena_get2(mstate a_tsd, size_t size)
{
  mstate a;
#ifdef PER_THREAD
  if (__builtin_expect (use_per_thread, 0)) {
    if ((a = get_free_list ()) == NULL
        && (a = reused_arena ()) == NULL)
      /* Nothing immediately available, so generate a new arena.  */
      a = _int_new_arena(size);
    return a;
  }
#endif

Therefore, if narenas is less than the limit tested for in reused_arena(), and 
N threads get in to this code at once, narenas can then end up N-1 *above* the 
limit.  The likelihood of this happening is proportional to the malloc arrival 
rate and the time spend in _int_new_arena().

This is exactly what I am seeing.

So if you can live with 2 arenas, the critical thing to do is to make sure 
narenas is exactly 2 before going heavily multi-threaded, and then it won't be 
able to go above 2; otherwise, it can sneak up to 2+N-1, where N is the number 
of threads contending for allocations.

If the ">=" in reused_arena() was changed to ">", then we could use this 
mechanism to limit narenas to exactly 1 right from the get-go.  That would be 
ideal for our kind of applications (that can't live with 2 arenas).
Comment 8 Marius Heuler 2011-08-27 21:45:04 UTC
We have exactly the same problem with the current implementation of malloc.

The suggested solutions by Ulrich using M_ARENA_MAX does not work since the check for number of arenas is not thread safe. In fact the limit is not working for heay threading applications where that would be needed! 

Since the number of cores and usage of threads will increase strongly there should be a solution for that kind of applications! If the arena limit would work as described we would have no problem.
Comment 9 Rich Testardi 2011-08-27 22:02:03 UTC
Hi,

We ended up building our own memory allocator -- it's faster and more efficient than glibc, and it works equally fast with threads and wihout.

We used the "small block allocator" concept from HP-UX where we only allocate huge (32MB) allocations from the system (after setting M_MMAP_THRESHOLD suitably small).

We then carve out large *naturally aligned* 1MB blocks from the huge allocation (accepting 3% waste, since the allocation was page alogned to begin with, not naturally aligned).

And we carve each one of those large blocks into small fixed size buckets (which are fractional powers of 2 -- like 16 bytes, 20, 24, 28, 32, 40, 48, 56, 64, 80, etc.).

Then we put the aligned addresses into a very fast hash and have a linked list for each bucket size.

This means our allocate routine is just a lock, linked list remove, unlock, on average, and our free routine is just a hash lookup, lock, linked list insert, unlock on average.

The trick here is that from any address being freed, you can get back to the naturally aligned 1MB block that contains it with just a pointer mask, and from there you can get the allocation's size as well as the head of the linked list of free entries to which it should be returned...

-- Rich

  ----- Original Message ----- 
  From: heuler at infosim dot net 
  To: rich@testardi.com 
  Sent: Saturday, August 27, 2011 3:45 PM
  Subject: [Bug libc/11261] malloc uses excessive memory for multi-threaded applications


  http://sourceware.org/bugzilla/show_bug.cgi?id=11261

  Marius Heuler <heuler at infosim dot net> changed:

             What    |Removed                     |Added
  ----------------------------------------------------------------------------
               Status|RESOLVED                    |REOPENED
                   CC|                            |heuler at infosim dot net
           Resolution|WONTFIX                     |

  --- Comment #8 from Marius Heuler <heuler at infosim dot net> 2011-08-27 21:45:04 UTC ---
  We have exactly the same problem with the current implementation of malloc.

  The suggested solutions by Ulrich using M_ARENA_MAX does not work since the
  check for number of arenas is not thread safe. In fact the limit is not working
  for heay threading applications where that would be needed! 

  Since the number of cores and usage of threads will increase strongly there
  should be a solution for that kind of applications! If the arena limit would
  work as described we would have no problem.

  -- 
  Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
  ------- You are receiving this mail because: -------
  You reported the bug.
Comment 10 Marius Heuler 2011-09-02 07:38:51 UTC
Created attachment 5917 [details]
Memory consumption with glibc malloc and jeMalloc (straight line).
Comment 11 Marius Heuler 2011-09-02 07:44:31 UTC
Comment on attachment 5917 [details]
Memory consumption with glibc malloc and jeMalloc (straight line).

We now changed to another malloc implementation: jeMalloc (http://www.canonware.com/jemalloc/) which is a magnitude superior to the glibc malloc. A similar implementation is also used in *BSD variants!
Linux/glibc should really improve their malloc since the current implementation is not sufficient for large applications. 
Why can't this implemenetion be used inside glibc? Is it GPL <-> BSD license problem?
Comment 12 Ulrich Drepper 2011-09-11 15:46:13 UTC
Stop reopening.  There is a solution for people who are stupid enough to create too many threads.  No implementation will be perfect for everyone.  The glibc implementation is tuned for reasonable programs and will run much faster than any other I tested.
Comment 13 Rich Testardi 2011-09-11 21:31:37 UTC
Let's all not take things so personally -- nobody here is stupid (and I'm sure some folks here are a *lot* smarter than other folks 
give them credit for)...

There are lots of reasons to create a half dozen threads and that's all it takes to make the glibc version perform absolutely 
horribly.

(And there can be no non-objective measurement that won't show my version of malloc is faster than yours -- so this has been a win 
all around for us, thanks...)

If you're not interested in improving glibc, you can just say so.

But stop name calling when you feel threatened -- my 5 year old daughter has already outgrown that.

-- Rich


-----Original Message----- 
From: drepper.fsp at gmail dot com
Sent: Sunday, September 11, 2011 9:46 AM
To: rich@testardi.com
Subject: [Bug libc/11261] malloc uses excessive memory for multi-threaded applications

http://sourceware.org/bugzilla/show_bug.cgi?id=11261

Ulrich Drepper <drepper.fsp at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |WORKSFORME

--- Comment #12 from Ulrich Drepper <drepper.fsp at gmail dot com> 2011-09-11 15:46:13 UTC ---
Stop reopening.  There is a solution for people who are stupid enough to create
too many threads.  No implementation will be perfect for everyone.  The glibc
implementation is tuned for reasonable programs and will run much faster than
any other I tested.
Comment 14 zhannk 2012-07-29 10:09:47 UTC
Ulrich Drepper, this huge virtual memory allocator could be a potential trouble maker on Linux6 with 64bit JVM. 
There is already one document on hadoop regarding to this issue, while their solution by setting MALLOC_ARENA_MAX=4 has no effect. we still found JVM with 30G virtual memory reported. 
https://issues.apache.org/jira/browse/HADOOP-7154
Comment 15 Carlos O'Donell 2013-03-14 19:03:05 UTC
This should have been fixed by the following commit:

commit 41b81892f11fe1353123e892158b53de73863d62
Author: Ulrich Drepper <drepper@gmail.com>
Date:   Tue Jan 31 14:42:34 2012 -0500

    Handle ARENA_TEST correctly

I have verified that using `mallopt (M_ARENA_MAX, 1)' that the limit of memory is bounded by the single arena.

creating 10 threads
allowing threads to contend to create preferred arenas
display preferred arenas
Arena 0:
system bytes     =     135168
in use bytes     =       2880
Total (incl. mmap):
system bytes     =     135168
in use bytes     =       2880
max mmap regions =          0
max mmap bytes   =          0
allowing threads to allocate 100MB each, sequentially in turn
thread 0 alloc 100MB
thread 0 free 100MB-20kB
thread 4 alloc 100MB
thread 4 free 100MB-20kB
thread 9 alloc 100MB
thread 9 free 100MB-20kB
thread 5 alloc 100MB
thread 5 free 100MB-20kB
thread 2 alloc 100MB
thread 2 free 100MB-20kB
thread 7 alloc 100MB
thread 7 free 100MB-20kB
thread 1 alloc 100MB
thread 1 free 100MB-20kB
thread 8 alloc 100MB
thread 8 free 100MB-20kB
thread 6 alloc 100MB
thread 6 free 100MB-20kB
thread 3 alloc 100MB
thread 3 free 100MB-20kB
Arena 0:
system bytes     =  100392960
in use bytes     =     201472
Total (incl. mmap):
system bytes     =  100392960
in use bytes     =     201472
max mmap regions =          0
max mmap bytes   =          0

Therefore the solution to a program with lots of threads is to limit the arenas as a trade-off for memory.
Comment 16 Ondrej Bilka 2013-12-12 00:22:07 UTC
> Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory.

That is a bandaid not a solution. Still there is no memory returned to system when one first does allocations and then allocates auxiliary memory like

void *calculate ()
{
  void **ary = malloc (1000000 * sizeof (void *))
  for (i = 0; i < 1000000; i++) ary[i] = malloc (100);
  for (i = 0; i <  999999; i++) free (ary [i]);
  return ary[999999];
}

When one acknowledges a bug a solution is relatively simple. Add a flag UNMAPPED for chunks which means that all pages completely contained in chunk were zeroed by madvise(s, n, MADV_DONTNEED).

You keep track of memory used and system and when their ratio is bigger than two you make chunks starting from largest ones UNMAPPED to decrease system charge.

This deals with RSS problem. A virtual space usage could still be excesive but that is smaller problem.
Comment 17 Siddhesh Poyarekar 2013-12-12 03:31:58 UTC
(In reply to Ondrej Bilka from comment #16)
> > Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory.
> 
> That is a bandaid not a solution. Still there is no memory returned to
> system when one first does allocations and then allocates auxiliary memory
> like

You have not understood the bug report.

> void *calculate ()
> {
>   void **ary = malloc (1000000 * sizeof (void *))
>   for (i = 0; i < 1000000; i++) ary[i] = malloc (100);
>   for (i = 0; i <  999999; i++) free (ary [i]);
>   return ary[999999];
> }

This is a different problem from the current bug report, which is about too many arenas getting created resulting in excessive address space usage
 and the MALLOC_ARENA_* variables not working to limit them.  Memory holes not being freed has nothing to do with it. 

> When one acknowledges a bug a solution is relatively simple. Add a flag
> UNMAPPED for chunks which means that all pages completely contained in chunk
> were zeroed by madvise(s, n, MADV_DONTNEED).
> 
> You keep track of memory used and system and when their ratio is bigger than
> two you make chunks starting from largest ones UNMAPPED to decrease system
> charge.
> 
> This deals with RSS problem. A virtual space usage could still be excesive
> but that is smaller problem.

The problem you've described is different and I'm sure there's a bug report open for it too.  madvise is not sufficient to free up commit charge; there's a mail thread on libc-alpha that discusses this problem that you can search for and read up on.  I think vm.overcommit_memory is one of the keywords to look for.
Comment 18 Ondrej Bilka 2013-12-12 08:41:51 UTC
On Thu, Dec 12, 2013 at 03:31:58AM +0000, siddhesh at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=11261
> 
> Siddhesh Poyarekar <siddhesh at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |siddhesh at redhat dot com
> 
> --- Comment #17 from Siddhesh Poyarekar <siddhesh at redhat dot com> ---
> (In reply to Ondrej Bilka from comment #16)
> > > Therefore the solution to a program with lots of threads is to limit the arenas > as a trade-off for memory.
> > 
> > That is a bandaid not a solution. Still there is no memory returned to
> > system when one first does allocations and then allocates auxiliary memory
> > like
> 
> You have not understood the bug report.
>
When you read discussion more carefully there are following posts where
this problem is mentioned:


Ulrich Drepper:

 You don't understand the difference between address space and allocated
 memory.

Rich Testardi:

Actually, I totally understand the difference and that is why I mentioned the 
fragmentation of memory...  When each arena has just a few straggling 
allocations, the maximum *committed* RAM required for the program's *working 
set* using the thread-preferred arena model is, in fact, N times that required 
for a traditional model, where N is the number of threads.  This shows up in 
real-world thrashing that could actually be avoided.  Basically, if the 
program is doing small allocations, a small percentage of stragglers can pin 
the entire allocated space -- and the allocated space is, in fact, much larger 
than it needs to be (and larger than it is in other OS's).  But thank you for
Comment 19 Siddhesh Poyarekar 2013-12-12 10:48:10 UTC
(In reply to Ondrej Bilka from comment #18)
> When you read discussion more carefully there are following posts where
> this problem is mentioned:
> 
> 
> Ulrich Drepper:
> 
>  You don't understand the difference between address space and allocated
>  memory.
> 
> Rich Testardi:
> 
> Actually, I totally understand the difference and that is why I mentioned
> the 
> fragmentation of memory...  When each arena has just a few straggling 
> allocations, the maximum *committed* RAM required for the program's *working 
> set* using the thread-preferred arena model is, in fact, N times that
> required 
> for a traditional model, where N is the number of threads.  This shows up in 
> real-world thrashing that could actually be avoided.  Basically, if the 
> program is doing small allocations, a small percentage of stragglers can pin 
> the entire allocated space -- and the allocated space is, in fact, much
> larger 
> than it needs to be (and larger than it is in other OS's).  But thank you for

Right, but most comments on the bug report (and the resolution) are in the context of malloc creating too many arenas and the switches not working.  Single allocations blocking an entire free space is not a multi-threaded problem - it occurs on single-threads too and is only compounded with multiple arenas.  I'd suggest working with a fresh bug report or an open bug report that describes this problem exactly (which I'm pretty sure there should be).
Comment 20 Jackie Rosen 2014-02-16 19:42:53 UTC Comment hidden (spam)
Comment 21 Carlos O'Donell 2015-02-12 20:04:34 UTC
I'm marking this fixed, since the tunnables that limit arena creation are fixed. You can limit the number of arenas in your application at the cost of thread contention during allocation (increased malloc latency). This does however limit the total VA usage. This is particularly true of 32-bit applications running close to the 32-bit VA limit.