This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RFC: NUMA spinlock


n multi-socket systems, memory is shared across the entire system.
Data access to the local socket is much faster than the remote socket
and data access to the local core is faster than sibling cores on the
same socket.  For serialized workloads with conventional spinlock,
when there is high spinlock contention between threads, lock ping-pong
among sockets becomes the bottleneck and the process spends majority
of its time in spinlock overhead.

On multi-socket systems, the keys to our NUMA spinlock performance
are to minimize cross-socket traffic as well as localize the serialized
workload to one core for execution.  The basic principles of NUMA
spinlock are mainly consisted of following approaches, which reduce
data movement and accelerate critical section, eventually give us
significant performance improvement.

1. MCS spinlock
MCS spinlock help us to reduce the useless lock movement in the
spinning state.  This paper provides a good description for this
kind of lock:
<http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>

2. Critical Section Integration (CSI)
Essentially spinlock is similar to that one core complete critical
sections one by one. So when contention happen, the serialized works
are sent to the core who is the lock owner and responsible to execute
them, that can save much time and power, because all shared data are
located in private cache of the lock owner.

We implemented this mechanism based on queued spinlock in kernel, that
speeds up critical section, and reduces the probability of contention.
The paper provides a good description for this kind of lock:
<https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>

3. NUMA Aware Spinlock (NAS)
Currently multi-socket systems give us better performance per watt,
however that also involves more complex synchronization requirement,
because off-chip data movement is much slower. We use distributed
synchronization mechanism to decrease Lock cache line to and from
different nodes. The paper provides a good description for this kind
of lock:
<https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>

4. Yield Schedule
When threads are applying for Critical Section Integration(CSI) with
known contention, they will delegate work to the thread who is the
lock owner, and wait for work to be completed.  The resources which
they are using should be transferred to other threads. In order to
accelerate the scenario, we introduce yield_sched function during
spinning stage.

5. Optimization when NUMA is ON or OFF.
Although programs can access memory with lower latency when NUMA is
enabled, some programs may need more memory bandwidth for computation
with NUMA disabled.  We also optimize multi-socket systems with NUMA
disabled.

NUMA spinlock flow chart (assuming there are 2 CPU nodes):

1. Threads from node_0/node_1 acquire local lock for node_0/1
respectively.  If the thread succeeds in acquiring local lock, it
goes to step 2, otherwise pushes critical function into current
local work queue, and enters into spinning stage with MCS mode.

2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
in acquiring the global lock as the lock owner, it goes to step 3,
otherwise waits until the lock owner thread releases the global lock.

3. The lock owner thread from node_0/1 enters into critical section,
cleans up work queue by performing all local critical functions
pushed at step 1 with CSI on behalf of other threads and informs
those spinning threads that their works have been done.  It then
releases the local lock.

4. The lock owner thread frees global lock.  If another thread is
waiting at step 2, the lock owner thread passes the global lock to
the waiting thread and returns.  The new lock owner thread enters
into step 3.  If no threads are waiting, the lock owner thread
releases the global lock and returns.  The whole critical section
process is completed.

Steps 1 and 2 mitigate global lock contention.  Only one thread
from different nodes will compete for the global lock in step 2.
Step 3 reduces the global lock & shared data movement because they
are located in the same node as well as the same core.  Our data
shows that Critical Section Integration (CSI) improves data locality
and NUMA-aware spinlock (NAS) helps CSI balance the workload.

NUMA spinlock can greatly speed up critical section on multi-socket
systems.  It should improve spinlock performance on all multi-socket
systems.

We implemented NUMA spinlock in a standalone numa-spinlock
library:

https://gitlab.com/numa-spinlock/numa-spinlock

with interface:

/* The NUMA spinlock.  */
struct numa_spinlock;

/* The NUMA spinlock information for each thread.  */
struct numa_spinlock_info
{
  /* The workload function of this thread.  */
  void *(*workload) (void *);
  /* The argument pointer passed to the workload function.  */
  void *argument;
  /* The return value of the workload function.  */
  void *result;
  /* The pointer to the NUMA spinlock.  */
  struct numa_spinlock *lock;
  /* The next thread on the local NUMA spinlock thread list.  */
  struct numa_spinlock_info *next;
  /* The NUMA node number.  */
  unsigned int node;
  /* Non-zero to indicate that the thread wants the NUMA spinlock.  */
  int pending;
  /* Reserved for future use.  */
  void *__reserved[4];
};

/* Return a pointer to a newly allocated NUMA spinlock.  */
extern struct numa_spinlock *init_numa_spinlock (void);

/* Free the memory space of the NUMA spinlock.  */
extern void release_numa_spinlock (struct numa_spinlock *);

/* Initialize the NUMA spinlock information block.  */
extern int init_numa_spinlock_info (struct numa_spinlock *,
    struct numa_spinlock_info *);

/* Apply for spinlock with a NUMA spinlock information block.  */
extern void apply_numa_spinlock (struct numa_spinlock_info *);

We compared NUMA spinlock performance against spinlock in glibc
with 3 different workloads.  NUMA spinlock has the best performance
up to 35x faster in some cases.

We are planning to submit a glibc patch to integrate NUMA spinlock
into libpthread.

We'd like to get feedbacks on

1. NUMA spinlock API.
2. NUMA spinlock performances on your favorite multi-socket systems.

Thanks.

-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]