Bug 30945 - Core affinity setting incurs lock contentions between threads
Summary: Core affinity setting incurs lock contentions between threads
Status: RESOLVED FIXED
Alias: None
Product: glibc
Classification: Unclassified
Component: malloc (show other bugs)
Version: 2.38
: P2 normal
Target Milestone: 2.39
Assignee: Adhemerval Zanella
URL:
Keywords:
: 29296 (view as bug list)
Depends on:
Blocks:
 
Reported: 2023-10-06 00:24 UTC by Chen Chen
Modified: 2024-02-12 21:49 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:
Project(s) to access:
ssh public key:


Attachments
example.c (919 bytes, text/x-csrc)
2023-10-06 00:24 UTC, Chen Chen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chen Chen 2023-10-06 00:24:43 UTC
Created attachment 15156 [details]
example.c

Hi,

I recently encounter poor malloc/free performance when building a
data-intensive application. The deserialization library we used works 10x
slower than expected. Investigations show that this is due to the arena_get2
function uses __get_nprocs_sched instead of __get_nprocs. Without changing core
affinity settings, this call returns the real number of cores so the upper
limit of total arenas is set correctly. However, if a thread is pinned to a
core, further malloc calls only sees n = 1 because the function returns only
schedulable cores. Therefore, the maximum number of arenas will be 8 on 64-bit
platforms.

This leads to arena lock contentions between threads if:

- The program spans multiple cores (say, more than 8 cores).
- Threads are pinned to cores before any malloc calls, so they have not
  attached to any arenas.
- Later memory allocations are served from the arenas.
- No MALLOC_ARENA_MAX tunable is set to manually increase the limit.

A mail thread about this briefly discussed this issue last year:
https://sourceware.org/pipermail/libc-alpha/2022-June/140123.html
However, it did not give a program that can be used to easily reproduce the
(un)expected behaviors. Here I would like to provide a minimal example that can
will expose the problem, and, if possible, initiate further discussions about
whether the core counting in arena_get can be better implemented.

The program accepts 3 arguments. The first one is the number of cores, the
second one is whether the thread is pinned to a core right after its creation,
and the third one is whether we would like to apply a small "fix". The fix is
add a free(malloc(8)) right before we set the affinity in each thread. In this
case, each thread can see all the cores so they can create and attach to a
"local" arena that is not shared. The output is the average time each thread
uses to finish a bunch of malloc/frees.

The following is the result I collected from my PC with 16-core Ryzen 9 5950X,
running Linux kernel 6.5.5 and glibc 2.38. The program is compiled using gcc
13.2.1 without optimizations flags.

    ./a.out 32 false false
    ---
    nr_cpu: 32 pin: no fix: no
    thread average (ms): 16.233663

    ./a.out 32 true false
    ---
    nr_cpu: 32 pin: yes fix: no
    thread average (ms): 1360.919047

    ./a.out 32 true true
    ---
    nr_cpu: 32 pin: yes fix: yes
    thread average (ms): 15.505453

    env GLIBC_TUNABLES='glibc.malloc.arena_max=32' ./a.out 32 true false
    ---
    nr_cpu: 32 pin: yes fix: no
    thread average (ms): 16.036667

Also recorded a few runs with perf. It suggested massive overheads in
__lll_lock_wait_private and __lll_lock_wake_private calls.
Comment 1 Adhemerval Zanella 2023-10-11 16:11:16 UTC
*** Bug 29296 has been marked as a duplicate of this bug. ***
Comment 2 Adhemerval Zanella 2023-11-22 14:19:01 UTC
Fixed on 2.39.
Comment 3 Kugan Vivekanandarajah 2024-02-11 22:04:04 UTC
Can we back port this to other release branches please. We also ran into this issue with OpenMP. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698.
Comment 4 Adhemerval Zanella 2024-02-12 13:24:41 UTC
I backported it to 2.34, 2.35, 2.36, 2.37, and 2.38.
Comment 5 Kugan Vivekanandarajah 2024-02-12 21:49:50 UTC
Thanks Adhemerval for the back ports.