This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] NUMA spinlock [BZ #23962]

From: Torvald Riegel <triegel at redhat dot com>
To: Rich Felker <dalias at libc dot org>, "H.J. Lu" <hjl dot tools at gmail dot com>
Cc: Ma Ling <ling dot ma dot program at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, "Lu, Hongjiu" <hongjiu dot lu at intel dot com>, "ling.ma" <ling dot ml at antfin dot com>, Wei Xiao <wei3 dot xiao at intel dot com>
Date: Tue, 15 Jan 2019 00:18:48 +0100
Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local> <20190103204338.GU23599@brightrain.aerifal.cx> <CAMe9rOoBhZmzuEoPGjhbxkYZ3mOaC-8tPUrSuhmcbnu8J19LpA@mail.gmail.com> <20190103212113.GV23599@brightrain.aerifal.cx>

On Thu, 2019-01-03 at 16:21 -0500, Rich Felker wrote:
> On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> > > 
> > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > > From: "ling.ma" <ling.ml@antfin.com>
> > > > 
> > > > On multi-socket systems, memory is shared across the entire system.
> > > > Data access to the local socket is much faster than the remote socket
> > > > and data access to the local core is faster than sibling cores on the
> > > > same socket.  For serialized workloads with conventional spinlock,
> > > > when there is high spinlock contention between threads, lock ping-pong
> > > > among sockets becomes the bottleneck and threads spend majority of
> > > > their time in spinlock overhead.
> > > > 
> > > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > > are to minimize cross-socket traffic as well as localize the serialized
> > > > workload to one core for execution.  The basic principles of NUMA
> > > > spinlock are mainly consisted of following approaches, which reduce
> > > > data movement and accelerate critical section, eventually give us
> > > > significant performance improvement.
> > > 
> > > I question whether this belongs in glibc. It seems highly application-
> > > and kernel-specific. Is there a reason you wouldn't prefer to
> > > implement and maintain it in a library for use in the kind of
> > > application that needs it?
> > 
> > This is a good question.  On the other hand,  the current spinlock
> > in glibc hasn't been changed for many years.  It doesn't scale for
> > today's hardware.  Having a scalable spinlock in glibc is desirable.
> 
> "Scalable spinlock" is something of an oxymoron.

No, that's not true at all.  Most high-performance shared-memory
synchronization constructs (on typical HW we have today) will do some kind
of spinning (and back-off), and there's nothing wrong about it.  This can
scale very well. 

> Spinlocks are for
> situations where contention is extremely rare,

No, the question is rather whether the program needs blocking through the
OS (for performance, or for semantics such as PI) or not.  Energy may be
another factor.  For example, glibc's current mutexes don't scale well on
short critical because there's not enough spinning being done.

In particular, in cases where there aren't more threads than cores (ie,
what lots of high-performance parallel applications will ensure), it's
better to just spin (and back off) than to eagerly block using the OS.

> since they inherently
> blow up badly under contention.

Did I mention back-off before? ;)

> If this is happening it means you
> wanted a mutex not a spinlock.

It doesn't make that much sense to have different interfaces for those, if
it weren't for PThreads mutexes being dynamically typed, unfortunately.

I believe that a high-performance default lock (C++11 or C11 semantics,
non-process-shared) would beat both our spinlocks and mutex implementations
that we have today.

We could tune the C11 mutex implementation, but the number of users would
still be small in the foreseeable future, I guess.  Tuning libstdc++'s
mutex implementation (so that it doesn't just use nptl mutexes but does
something that's closer to the state of the art) would reach more users.

Follow-Ups:
- Re: [PATCH] NUMA spinlock [BZ #23962]
  - From: kemi

References:
- Re: [PATCH] NUMA spinlock [BZ #23962]
  - From: Rich Felker
- Re: [PATCH] NUMA spinlock [BZ #23962]
  - From: H.J. Lu
- Re: [PATCH] NUMA spinlock [BZ #23962]
  - From: Rich Felker

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]