Excessive memory consumption when using malloc()

Mon Nov 29 19:44:03 GMT 2021

Hello all,

meanwhile, I spent quite some time analyzing the reports generated by
malloc_info() at various stages of my program. I also ran a few
additional experiments. By now, I am quite confident that I have found
the root cause of the problem.

As you might remember, my application spawns two threads that execute
the memory-intensive computations. The lifetime of these threads is
limited to the computations, so when the computations end, both
computation threads will be terminated. As soon as the computations are
resumed, the two threads are started again.

The issue why memory consumption increases so much over time must be
that the computation threads are randomly assigned to different arenas.
But then, memory can seemingly not be moved between arenas. This is also
what I found in some tcmalloc documentation
(http://goog-perftools.sourceforge.net/doc/tcmalloc.html):

 > ptmalloc2 also reduces lock contention by using per-thread arenas but
there is a big problem with ptmalloc2's use of per-thread arenas.
 > In ptmalloc2 memory can never move from one arena to another. This
can lead to huge amounts of wasted space. For example, in one
 > Google application, the first phase would allocate approximately
300MB of memory for its data structures. When the first phase finished,
 > a second phase would be started in the same address space. If this
second phase was assigned a different arena than the one used by the
 > first phase, this phase would not reuse any of the memory left after
the first phase and would add another 300MB to the address space.
 > Similar memory blowup problems were also noticed in other applications.

That seems to describe exactly the issue I am experiencing. I tried to
change the number of arenas (via M_ARENA_MAX in mallopt()) and found out
that the memory usage of my application varies greatly depending on the
number of arenas. Memory usage remains below 10 GB if I execute the
calculations repeatedly with just one arena, but increases as soon as I
configure my application to use 2 or more arenas. When using 5 arenas,
the memory usage gets so high that my processing workstation eventually
runs out of memory.

In the reports generated by malloc_info(), I can see that the size of 1
or 2 arenas increases between each runs of the computations. Because of
heap fragmentation, spare memory in the unused arenas is not returned
back to the OS and the memory footprint of the application grows.

Of course, calling malloc_trim() internally (as is currently being
discussed here on this mailing list) would solve this problem, but even
more promising to me seems how tcmalloc handles large allocations (>256
KB). In tcmalloc, large allocations are done in the backend and there is
only one backend per application. This makes sure that big memory chunks
can be reused around thread boundaries. This avoids the memory blowup
problems in my application. I guess using tcmalloc is the better choice
for me - even if glibc malloc would call malloc_trim() internally, then
I guess the repeated calls to malloc_trim() would slow my application
down. With tcmalloc, my application should perform better. Even though
tcmalloc also suffers from the problem that it cannot compact a
fragmented heap, it doesn't hurt so much as there is only one "arena"
where large memory chunks are allocated - and that is the backend. This
one arena never gets so large that my workstation runs out of memory.

Best regards,

    Christian

On 11/26/21 6:58 PM, Christian Hoff via Libc-help wrote:
> Hello Carlos,
>
> many thanks for your support.
>
> On 11/25/21 7:20 PM, Carlos O'Donell wrote:
>
>> How many cpus does the system have?
> We have 8 CPUs.
>> How many threads do you create?
> Our computations are running in 2 threads. However, the lifetime of both
> threads is limited to the duration of the computation. Both threads exit
> after the calculations are complete. The next round of computations will
> be performed by 2 other, newly started threads.
>> Is this 10GiB of RSS or VSS?
>
> It is 10 GiB of RSS memory. Meaning that the calculations each time
> allocate 10 GiB of memory, which is then used and so it becomes resident
> RSS memory. After the calculations are done, we free() this memory
> again. At this point, the memory is returned to the glibc allocator
>
>> This coalescing and freeing is prevented it there are in-use chunks
>> in the heap.
>>
>> Consider this scenario:
>> - Make many large allocations that have a short lifetime.
>> - Make one small allocation that has a very long lifetime.
>> - Free all the large allocations.
>>
>> The heap cannot be freed downwards because of the small long
>> liftetime allocation.
>>
>> The call to malloc_trim() walks the heap chunks and frees page-sized
>> chunks or
>> larger without the requirement that they come from the top of the heap.
>>
>> In glibc's allocator, mixing lifetimes for allocations will cause
>> heap growth.
> I think that is exactly what is happening in our case. Thanks for the
> explanation!
>> I have an important question to ask now:
>>
>> Do you use aligned allocations?
>>
>> We have right now an outstanding defect where aligned allocations
>> create small
>> residual free chunks, and when free'd back and allocated again as an
>> aligned
>> chunk, we are forced to split chunks again, which can lead to
>> ratcheting effects
>> with certain aligned allocations.
>>
>> We had a prototype patch for this in Fedora in 2019:
>> https://lists.fedoraproject.org/archives/list/glibc@lists.fedoraproject.org/thread/2PCHP5UWONIOAEUG34YBAQQYD7JL5JJ4/
>>
>>
> No, the 512 KiB allocations for the computation are not aligned. We just
> request it using malloc(). But the application is a Java application
> that is running some native C++ code. And I don't know if Java allocates
> some aligned memory. But the vast majority of allocations are ~512 KiB
> and these are not aligned.
>>> And then we also have one other problem. The first run of the
>>> computations is always fine: we allocate 10 GB of memory and the
>>> application grows to 10 GB. Afterwards, we release those 10 GB of
>>> memory
>>> since the computations are now done and at this point the freed memory
>>> is returned back to the allocator (however, the size of the process
>>> remains 10 GB unless we call malloc_trim()). But if we now re-run the
>>> same computations again a second time (this time using different
>>> threads), a problem occurs. In this case, the size of the application
>>> grows well beyond 10 GB. It can get 20 GB or larger and the process is
>>> eventually killed because the system runs out of memory.
>> You need to determine what is going on under the hood here.
>>
>> You may want to just use malloc_info() to get a routine dump of the
>> heap state.
>>
>> This will give us a starting point to see what is growing.
>
> To make it easier to run multiple rounds of calculations, I have now
> modified the code a bit so that only ~5 GiB of memory is allocated every
> time when we perform the computations. The 5 GiB are still allocated in
> chunks of about 512 KiB. After the calculations, all this memory is
> free()'ed again.
>
> After running the calculations for the first time, we see that the
> application consumes about 5 GiB of RSS memory. But our problem is that
> if we run the computations again for a second and third time, the memory
> usage increases even beyond 5 GiB, even though we release all memory
> that the calculation consumes after each iteration. After running the
> same workload 12 times, our processing workstation runs out of memory
> and gets very slow. In detail, the memory consumption after each round
> of calculations is captured in the table below.
>
> After Iteration   Memory Consumption
> 1             5.13 GiB
> 2             8.9 GiB
> 3             10.48 GiB
> 4             14.48 GiB
> 5             18.11 GiB
> 6             16.03 GiB
> ......           ..............
> 12            21.79 GiB
>
> As you can see, the RSS memory usage of our application increases
> continuously, especially during the first few rounds of calculations.
> Our expectation would be that the RSS memory usage remains at 5 GiB as
> the computations only allocate about 5 GiB of memory each time. After
> running the computations 12 times, glibc allocator caches have grown to
> over 20 GiB. All this memory can only be reclaimed by calling
> malloc_trim().
>
> I have also attached the traces from malloc_info() after each iteration
> of the computations. The first trace ("after_run_no_1.txt") was captured
> after running the computations once and shows a relatively low memory
> usage of 5 GiB. But in the subsequent traces, memory consumption
> increases. Our application has 64 arenas (which is eight times eight
> CPUs).
>
> As I mentioned, the lifetime of the 2 computation threads is limited to
> the computation itself. The threads will be restarted with each run of
> the computations. Once the computation starts, the threads are assigned
> to an arena. Could it be that these two threads are always assigned to
> different arenas on each run of the computations and could that explain
> that the glibc allocator caches are growing from run to run?
>
>> We have a malloc allocation tracer that you can use to capture a
>> workload and
>> share a snapshot of the workload with upstream:
>> https://pagure.io/glibc-malloc-trace-utils
>>
>> Sharing the workload might be hard because this is a full API trace
>> and it gets
>> difficult to share.
>
> For now, I haven't done that yet (as it would be difficult to share,
> just as you said). I hope that the malloc_info() traces already give a
> picture of what is happening. But if we need them, I will be happy to
> capture these traces.
>
>
> Best regards,
>
>    Christian
>