Excessive memory consumption when using malloc()

Adhemerval Zanella adhemerval.zanella@linaro.org
Thu Nov 25 20:56:11 GMT 2021



On 25/11/2021 15:21, Carlos O'Donell via Libc-help wrote:
> On 11/25/21 13:12, Konstantin Kharlamov via Libc-help wrote:
>> So there you go, you 10G of unreleased memory is a Glibc feature, no complaints
>> ;-P
> 
> Freeing memory back to the OS is a form of cache invalidation, and cache
> invalidation is hard and workload dependent.
> 
> In this specific case, particularly with 50MiB, you are within the 64MiB
> 64-bit process heap size, and the 1024-byte frees do not trigger the
> performance expensive consolidation and heap reduction (which requires
> a munmap syscall to release the resources).
> 
> In the case of 10GiB, and 512KiB allocations, we are talking different
> behaviour. I have responded here with my recommendations:
> https://sourceware.org/pipermail/libc-help/2021-November/006052.html
> 
The BZ#27103 issues seems to be a memory fragmentation due the usage of
sbrk() plus the deallocation done in reverse order, which prevents free()
to coalescence the previous allocation automatically.

For instance with the testcase below:

$ gcc -Wall test.c -o test -DNTIMES=50000 -DCHUNK=1024
$ ./test
memory usage: 1036 Kb
allocate ...done
memory usage: 52812 Kb

If you force the mmap usage:

$ GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 1044 Kb
allocate ...done
memory usage: 2052 Kb

As Carlos has put, it is tradeoff since sbrk() is usually faster to expand
the data segments compared to mmap() and subsequent allocations will fill
the fragmented heap (so multiple allocation will avoid further memory
fragmentation).

Just to give you comparative, always using mmap() incurs more page-faults
and way more cpu utilization

$ perf stat ./test
memory usage: 964 Kb
allocate ...done
memory usage: 52796 Kb
memory usage: 52796 Kb
allocate ...done
memory usage: 52796 Kb

 Performance counter stats for './test':

             15.22 msec task-clock                #    0.983 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            12,853      page-faults               #  844.546 K/sec                  
        68,518,548      cycles                    #    4.502 GHz                      (73.73%)
           480,717      stalled-cycles-frontend   #    0.70% frontend cycles idle     (73.72%)
             2,333      stalled-cycles-backend    #    0.00% backend cycles idle      (73.72%)
       105,356,108      instructions              #    1.54  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (91.81%)
        23,787,860      branches                  #    1.563 G/sec                  
            58,990      branch-misses             #    0.25% of all branches          (87.01%)

       0.015478114 seconds time elapsed

       0.010348000 seconds user
       0.005174000 seconds sys


$ perf stat env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test
memory usage: 956 Kb
allocate ...done
memory usage: 2012 Kb
memory usage: 2012 Kb
allocate ...done
memory usage: 2012 Kb

 Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.mmap_threshold=0 ./test':

            156.52 msec task-clock                #    0.998 CPUs utilized          
                 1      context-switches          #    6.389 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
           100,228      page-faults               #  640.338 K/sec                  
       738,047,682      cycles                    #    4.715 GHz                      (82.11%)
         8,779,463      stalled-cycles-frontend   #    1.19% frontend cycles idle     (82.11%)
            34,195      stalled-cycles-backend    #    0.00% backend cycles idle      (82.97%)
     1,254,219,911      instructions              #    1.70  insn per cycle         
                                                  #    0.01  stalled cycles per insn  (84.68%)
       237,180,662      branches                  #    1.515 G/sec                    (84.67%)
           687,051      branch-misses             #    0.29% of all branches          (83.46%)

       0.156904324 seconds time elapsed

       0.024142000 seconds user
       0.132786000 seconds sys

That's why I think it might not be the best strategy to use the mmap() strategy
as default. What I think we might improve is to maybe add an heuristic to call
malloc_trim once a certain level of fragmentation in the main_arena is found.
The question is which metric and threshold to use.  The trimming does have
a cost, however I think it worth to decrease fragmentation and memory utilization.

---

$ cat test.c
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

static size_t pagesize;

static size_t
read_rss (void)
{
  int fd = open ("/proc/self/statm", O_RDONLY);
  assert (fd != -1);
  char line[256];
  ssize_t r = read (fd, line, sizeof (line));
  assert (r != -1);
  line[r] = '\0';
  size_t rss;
  sscanf (line, "%*u %zu %*u %*u 0 %*u 0\n", &rss);
  close (fd);
  return rss * pagesize;
}

static void *
allocate (void *args)
{
  enum { chunk = CHUNK };
  enum { ntimes = NTIMES * chunk };

  void *chunks[NTIMES];
  for (int i = 0; i < sizeof (chunks) / sizeof (chunks[0]); i++)
    {
      chunks[i] = malloc (chunk);
      memset (chunks[i], 0, chunk);
      assert (chunks[i] != NULL);
    }

  for (int i = (sizeof (chunks) / sizeof (chunks[0])) - 1; i >= 0; i--)
    free (chunks[i]);

  return NULL;
}

int main (int argc, char *argv[])
{
  pagesize = sysconf (_SC_PAGESIZE);
  assert (pagesize != -1);
  {
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
    printf ("allocate ...");
    allocate (NULL);
    printf ("done\n");
    printf ("memory usage: %zu Kb\n", read_rss () / 1024);
  }

  return 0;
} 


More information about the Libc-help mailing list