Excessive memory consumption when using malloc()
Adhemerval Zanella
adhemerval.zanella@linaro.org
Thu Nov 25 20:56:11 GMT 2021
On 25/11/2021 15:21, Carlos O'Donell via Libc-help wrote:
> On 11/25/21 13:12, Konstantin Kharlamov via Libc-help wrote:
>> So there you go, you 10G of unreleased memory is a Glibc feature, no complaints
>> ;-P
>
> Freeing memory back to the OS is a form of cache invalidation, and cache
> invalidation is hard and workload dependent.
>
> In this specific case, particularly with 50MiB, you are within the 64MiB
> 64-bit process heap size, and the 1024-byte frees do not trigger the
> performance expensive consolidation and heap reduction (which requires
> a munmap syscall to release the resources).
>
> In the case of 10GiB, and 512KiB allocations, we are talking different
> behaviour. I have responded here with my recommendations:
> https://sourceware.org/pipermail/libc-help/2021-November/006052.html
>
The BZ#27103 issues seems to be a memory fragmentation due the usage of
sbrk() plus the deallocation done in reverse order, which prevents free()
to coalescence the previous allocation automatically.
For instance with the testcase below:
$ gcc -Wall test.c -o test -DNTIMES=50000 -DCHUNK=1024
$ ./test
memory usage: 1036 Kb
allocate ...done
memory usage: 52812 Kb
If you force the mmap usage:
$ GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 1044 Kb
allocate ...done
memory usage: 2052 Kb
As Carlos has put, it is tradeoff since sbrk() is usually faster to expand
the data segments compared to mmap() and subsequent allocations will fill
the fragmented heap (so multiple allocation will avoid further memory
fragmentation).
Just to give you comparative, always using mmap() incurs more page-faults
and way more cpu utilization
$ perf stat ./test
memory usage: 964 Kb
allocate ...done
memory usage: 52796 Kb
memory usage: 52796 Kb
allocate ...done
memory usage: 52796 Kb
Performance counter stats for './test':
15.22 msec task-clock # 0.983 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
12,853 page-faults # 844.546 K/sec
68,518,548 cycles # 4.502 GHz (73.73%)
480,717 stalled-cycles-frontend # 0.70% frontend cycles idle (73.72%)
2,333 stalled-cycles-backend # 0.00% backend cycles idle (73.72%)
105,356,108 instructions # 1.54 insn per cycle
# 0.00 stalled cycles per insn (91.81%)
23,787,860 branches # 1.563 G/sec
58,990 branch-misses # 0.25% of all branches (87.01%)
0.015478114 seconds time elapsed
0.010348000 seconds user
0.005174000 seconds sys
$ perf stat env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 956 Kb
allocate ...done
memory usage: 2012 Kb
memory usage: 2012 Kb
allocate ...done
memory usage: 2012 Kb
Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test':
156.52 msec task-clock # 0.998 CPUs utilized
1 context-switches # 6.389 /sec
0 cpu-migrations # 0.000 /sec
100,228 page-faults # 640.338 K/sec
738,047,682 cycles # 4.715 GHz (82.11%)
8,779,463 stalled-cycles-frontend # 1.19% frontend cycles idle (82.11%)
34,195 stalled-cycles-backend # 0.00% backend cycles idle (82.97%)
1,254,219,911 instructions # 1.70 insn per cycle
# 0.01 stalled cycles per insn (84.68%)
237,180,662 branches # 1.515 G/sec (84.67%)
687,051 branch-misses # 0.29% of all branches (83.46%)
0.156904324 seconds time elapsed
0.024142000 seconds user
0.132786000 seconds sys
That's why I think it might not be the best strategy to use the mmap() strategy
as default. What I think we might improve is to maybe add an heuristic to call
malloc_trim once a certain level of fragmentation in the main_arena is found.
The question is which metric and threshold to use. The trimming does have
a cost, however I think it worth to decrease fragmentation and memory utilization.
---
$ cat test.c
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
static size_t pagesize;
static size_t
read_rss (void)
{
int fd = open ("/proc/self/statm", O_RDONLY);
assert (fd != -1);
char line[256];
ssize_t r = read (fd, line, sizeof (line));
assert (r != -1);
line[r] = '\0';
size_t rss;
sscanf (line, "%*u %zu %*u %*u 0 %*u 0\n", &rss);
close (fd);
return rss * pagesize;
}
static void *
allocate (void *args)
{
enum { chunk = CHUNK };
enum { ntimes = NTIMES * chunk };
void *chunks[NTIMES];
for (int i = 0; i < sizeof (chunks) / sizeof (chunks[0]); i++)
{
chunks[i] = malloc (chunk);
memset (chunks[i], 0, chunk);
assert (chunks[i] != NULL);
}
for (int i = (sizeof (chunks) / sizeof (chunks[0])) - 1; i >= 0; i--)
free (chunks[i]);
return NULL;
}
int main (int argc, char *argv[])
{
pagesize = sysconf (_SC_PAGESIZE);
assert (pagesize != -1);
{
printf ("memory usage: %zu Kb\n", read_rss () / 1024);
printf ("allocate ...");
allocate (NULL);
printf ("done\n");
printf ("memory usage: %zu Kb\n", read_rss () / 1024);
}
return 0;
}
More information about the Libc-help
mailing list