30625 – Moving "free(buf)" slows down code x1.6

Bug 30625 - Moving "free(buf)" slows down code x1.6

Summary: Moving "free(buf)" slows down code x1.6

Status:	UNCONFIRMED

Alias:	None

Product:	glibc
Classification:	Unclassified
Component:	malloc (show other bugs)
Version:	2.36

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-07-10 13:01 UTC by Askar Safin
Modified:	2024-01-14 18:58 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Flags:	fweimer: security-

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Askar Safin 2023-07-10 13:01:05 UTC

Small adjustments to placement of malloc/free slows down code x1.6. It seems I found some worst-case behavior in glibc's allocator.

Here is code:

====
#define HASH_VEC_ITEM_SIZE (3 * 8)
#define BUF_SIZE 4194304

#include <stdlib.h>
#include <stddef.h>
#include <string.h>

#define ASSERT(cond) do { if(!(cond))abort(); } while(0)

struct vec
{
  unsigned char *data;
  size_t size;
  size_t capacity; // capacity >= size
};

// We emulate rust's Vec::push
void
push (struct vec *v, size_t item_size, unsigned char *new_data)
{
  ASSERT(v->capacity >= v->size);
  if (v->size + item_size <= v->capacity)
    {
      memcpy(v->data + v->size, new_data, item_size);
      v->size += item_size;
      return;
    }
  v->capacity *= 2;
  if(v->capacity < v->size + item_size)
    {
      v->capacity = v->size + item_size;
    }
  v->data = realloc(v->data, v->capacity);
  memcpy(v->data + v->size, new_data, item_size);
  v->size += item_size;
  ASSERT(v->capacity >= v->size);
}

// To prevent optimization
// https://boringssl.googlesource.com/boringssl/+/e38cf79cdf47606f6768fb85dc066d7ebce304ac/crypto/internal.h#281
void
black_box (unsigned char *arg) __attribute__((noinline));
void
black_box (unsigned char *arg)
{
  asm volatile("" : "+r"(arg) :);
  asm volatile("" : "+r"(arg[0]) :);
}

int
main ()
{
  struct vec hash_vec = { .data = malloc(HASH_VEC_ITEM_SIZE), .size = 0, .capacity = HASH_VEC_ITEM_SIZE };
  for(int n = 0; n != 100; ++n)
    {
      unsigned char *buf = calloc(BUF_SIZE, 1);
      for(int i = 0; i != 5; ++i)
        {
          unsigned char *buf_clone = malloc(BUF_SIZE);
          memcpy(buf_clone, buf, BUF_SIZE);
          black_box(buf_clone);
          free(buf_clone);
        }
      calloc(2, 1); // We don't free this memory, we free everything else
      free(buf); //bad placement
      unsigned char new_item[HASH_VEC_ITEM_SIZE] = {0};
      push(&hash_vec, HASH_VEC_ITEM_SIZE, new_item);
      //free(buf); //good placement
    }
  free(hash_vec.data);
}
====

Compile so: "gcc -O3 -o /tmp/t /tmp/t.c"

"/lib64/ld-linux-x86-64.so.2 --version" output:

====
ld.so (Debian GLIBC 2.36-9) stable release version 2.36.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
====

gcc is 12.3.0

Everything happens in debian sid x86_64 in docker container in Linux 5.10.0

When "free(buf)" placed in "good placement" the code above runs 0.17 s, in "bad placement" - 0.28 s (i. e. x1.6 slower)

This bug was triggered in absolutely real production case. Here is context: https://github.com/rust-lang/rust/issues/113504 . @saethlin reduced the example further here: https://github.com/rust-lang/rust/issues/113504#issuecomment-1627852074 and then I reduced it even more to C language in this (glibc) bug report.

So, small adjustments to "free" placement slows down code significantly, I think this is a bug. @saethlin also adds: "Based on profiling, the difference seems attributable to a 44x (!!) difference in the number of page faults between the two implementations. If I swap in jemalloc or mimalloc, the difference in runtime and page faults goes away. So I strongly suspect that this code is generating some worst-case behavior in glibc's allocator"

Comment 1 Florian Weimer 2023-07-11 07:37:41 UTC

The bad placement results in many brk calls (about three per loop iteration). It's still present in the current sources.

The trimming heuristics are not unreasonable in this case, given how large the BUF_SIZE allocation is. What you call “good placement” prevents heap trimming; others have complained that we don't release memory in the middle of the heap to the operating system, and they would call this “bad placement”.

We just don't know if the program will go to sleep after the free call, or enter some longer computation not calling malloc/free, so if free detects that there is a certain amount of memory that could be returned to the kernel, it does that.

Comment 2 Askar Safin 2023-07-11 11:21:59 UTC

Please, return memory to OS only if 2 following conditions hold in the same time:
- Allocated memory is significantly smaller than latest peak
- Enough time passed from latest peak

Comment 3 Florian Weimer 2023-07-11 12:15:39 UTC

(In reply to Askar Safin from comment #2)
> Please, return memory to OS only if 2 following conditions hold in the same
> time:
> - Allocated memory is significantly smaller than latest peak

I think this is true in both cases in the reproducer?  Doesn't the memory drop by one third of the peak heap size (twice)?

> - Enough time passed from latest peak

That can cause a lot of memory be left behind if the program goes to sleep, as I tried to explain. In a multi-threaded program, we could perhaps reclaim the memory from a service thread, but in a single-threaded program this isn't possible.