Created attachment 15156 [details] example.c Hi, I recently encounter poor malloc/free performance when building a data-intensive application. The deserialization library we used works 10x slower than expected. Investigations show that this is due to the arena_get2 function uses __get_nprocs_sched instead of __get_nprocs. Without changing core affinity settings, this call returns the real number of cores so the upper limit of total arenas is set correctly. However, if a thread is pinned to a core, further malloc calls only sees n = 1 because the function returns only schedulable cores. Therefore, the maximum number of arenas will be 8 on 64-bit platforms. This leads to arena lock contentions between threads if: - The program spans multiple cores (say, more than 8 cores). - Threads are pinned to cores before any malloc calls, so they have not attached to any arenas. - Later memory allocations are served from the arenas. - No MALLOC_ARENA_MAX tunable is set to manually increase the limit. A mail thread about this briefly discussed this issue last year: https://sourceware.org/pipermail/libc-alpha/2022-June/140123.html However, it did not give a program that can be used to easily reproduce the (un)expected behaviors. Here I would like to provide a minimal example that can will expose the problem, and, if possible, initiate further discussions about whether the core counting in arena_get can be better implemented. The program accepts 3 arguments. The first one is the number of cores, the second one is whether the thread is pinned to a core right after its creation, and the third one is whether we would like to apply a small "fix". The fix is add a free(malloc(8)) right before we set the affinity in each thread. In this case, each thread can see all the cores so they can create and attach to a "local" arena that is not shared. The output is the average time each thread uses to finish a bunch of malloc/frees. The following is the result I collected from my PC with 16-core Ryzen 9 5950X, running Linux kernel 6.5.5 and glibc 2.38. The program is compiled using gcc 13.2.1 without optimizations flags. ./a.out 32 false false --- nr_cpu: 32 pin: no fix: no thread average (ms): 16.233663 ./a.out 32 true false --- nr_cpu: 32 pin: yes fix: no thread average (ms): 1360.919047 ./a.out 32 true true --- nr_cpu: 32 pin: yes fix: yes thread average (ms): 15.505453 env GLIBC_TUNABLES='glibc.malloc.arena_max=32' ./a.out 32 true false --- nr_cpu: 32 pin: yes fix: no thread average (ms): 16.036667 Also recorded a few runs with perf. It suggested massive overheads in __lll_lock_wait_private and __lll_lock_wake_private calls.
*** Bug 29296 has been marked as a duplicate of this bug. ***
Fixed on 2.39.
Can we back port this to other release branches please. We also ran into this issue with OpenMP. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698.
I backported it to 2.34, 2.35, 2.36, 2.37, and 2.38.
Thanks Adhemerval for the back ports.