Small adjustments to placement of malloc/free slows down code x1.6. It seems I found some worst-case behavior in glibc's allocator. Here is code: ==== #define HASH_VEC_ITEM_SIZE (3 * 8) #define BUF_SIZE 4194304 #include <stdlib.h> #include <stddef.h> #include <string.h> #define ASSERT(cond) do { if(!(cond))abort(); } while(0) struct vec { unsigned char *data; size_t size; size_t capacity; // capacity >= size }; // We emulate rust's Vec::push void push (struct vec *v, size_t item_size, unsigned char *new_data) { ASSERT(v->capacity >= v->size); if (v->size + item_size <= v->capacity) { memcpy(v->data + v->size, new_data, item_size); v->size += item_size; return; } v->capacity *= 2; if(v->capacity < v->size + item_size) { v->capacity = v->size + item_size; } v->data = realloc(v->data, v->capacity); memcpy(v->data + v->size, new_data, item_size); v->size += item_size; ASSERT(v->capacity >= v->size); } // To prevent optimization // https://boringssl.googlesource.com/boringssl/+/e38cf79cdf47606f6768fb85dc066d7ebce304ac/crypto/internal.h#281 void black_box (unsigned char *arg) __attribute__((noinline)); void black_box (unsigned char *arg) { asm volatile("" : "+r"(arg) :); asm volatile("" : "+r"(arg[0]) :); } int main () { struct vec hash_vec = { .data = malloc(HASH_VEC_ITEM_SIZE), .size = 0, .capacity = HASH_VEC_ITEM_SIZE }; for(int n = 0; n != 100; ++n) { unsigned char *buf = calloc(BUF_SIZE, 1); for(int i = 0; i != 5; ++i) { unsigned char *buf_clone = malloc(BUF_SIZE); memcpy(buf_clone, buf, BUF_SIZE); black_box(buf_clone); free(buf_clone); } calloc(2, 1); // We don't free this memory, we free everything else free(buf); //bad placement unsigned char new_item[HASH_VEC_ITEM_SIZE] = {0}; push(&hash_vec, HASH_VEC_ITEM_SIZE, new_item); //free(buf); //good placement } free(hash_vec.data); } ==== Compile so: "gcc -O3 -o /tmp/t /tmp/t.c" "/lib64/ld-linux-x86-64.so.2 --version" output: ==== ld.so (Debian GLIBC 2.36-9) stable release version 2.36. Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ==== gcc is 12.3.0 Everything happens in debian sid x86_64 in docker container in Linux 5.10.0 When "free(buf)" placed in "good placement" the code above runs 0.17 s, in "bad placement" - 0.28 s (i. e. x1.6 slower) This bug was triggered in absolutely real production case. Here is context: https://github.com/rust-lang/rust/issues/113504 . @saethlin reduced the example further here: https://github.com/rust-lang/rust/issues/113504#issuecomment-1627852074 and then I reduced it even more to C language in this (glibc) bug report. So, small adjustments to "free" placement slows down code significantly, I think this is a bug. @saethlin also adds: "Based on profiling, the difference seems attributable to a 44x (!!) difference in the number of page faults between the two implementations. If I swap in jemalloc or mimalloc, the difference in runtime and page faults goes away. So I strongly suspect that this code is generating some worst-case behavior in glibc's allocator"
The bad placement results in many brk calls (about three per loop iteration). It's still present in the current sources. The trimming heuristics are not unreasonable in this case, given how large the BUF_SIZE allocation is. What you call “good placement” prevents heap trimming; others have complained that we don't release memory in the middle of the heap to the operating system, and they would call this “bad placement”. We just don't know if the program will go to sleep after the free call, or enter some longer computation not calling malloc/free, so if free detects that there is a certain amount of memory that could be returned to the kernel, it does that.
Please, return memory to OS only if 2 following conditions hold in the same time: - Allocated memory is significantly smaller than latest peak - Enough time passed from latest peak
(In reply to Askar Safin from comment #2) > Please, return memory to OS only if 2 following conditions hold in the same > time: > - Allocated memory is significantly smaller than latest peak I think this is true in both cases in the reproducer? Doesn't the memory drop by one third of the peak heap size (twice)? > - Enough time passed from latest peak That can cause a lot of memory be left behind if the program goes to sleep, as I tried to explain. In a multi-threaded program, we could perhaps reclaim the memory from a service thread, but in a single-threaded program this isn't possible.