This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] NUMA spinlock [BZ #23962]
- From: Torvald Riegel <triegel at redhat dot com>
- To: kemi <kemi dot wang at intel dot com>, Rich Felker <dalias at libc dot org>, "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: Ma Ling <ling dot ma dot program at gmail dot com>, GNU C Library <libc-alpha at sourceware dot org>, "Lu, Hongjiu" <hongjiu dot lu at intel dot com>, "ling.ma" <ling dot ml at antfin dot com>, Wei Xiao <wei3 dot xiao at intel dot com>
- Date: Mon, 04 Feb 2019 18:23:15 +0100
- Subject: Re: [PATCH] NUMA spinlock [BZ #23962]
- References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local> <20190103204338.GU23599@brightrain.aerifal.cx> <CAMe9rOoBhZmzuEoPGjhbxkYZ3mOaC-8tPUrSuhmcbnu8J19LpA@mail.gmail.com> <20190103212113.GV23599@brightrain.aerifal.cx> <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org> <email@example.com>
On Thu, 2019-01-17 at 11:05 +0800, kemi wrote:
> On 2019/1/15 下午8:36, Torvald Riegel wrote:
> > On Tue, 2019-01-15 at 10:28 +0800, kemi wrote:
> > > > > "Scalable spinlock" is something of an oxymoron.
> > > >
> > > > No, that's not true at all. Most high-performance shared-memory
> > > > synchronization constructs (on typical HW we have today) will do some kind
> > > > of spinning (and back-off), and there's nothing wrong about it. This can
> > > > scale very well.
> > > >
> > > > > Spinlocks are for
> > > > > situations where contention is extremely rare,
> > > >
> > > > No, the question is rather whether the program needs blocking through the
> > > > OS (for performance, or for semantics such as PI) or not. Energy may be
> > > > another factor. For example, glibc's current mutexes don't scale well on
> > > > short critical because there's not enough spinning being done.
> > > >
> > >
> > > yes. That's why we need pthread.mutex.spin_count tunable interface before.
> > I don't think we need the tunable interface before that. Where we need to
> > improve performance most is for applications that don't want to bother
> > tuning their mutexes -- that's where the broadest gains are overall, I
> > think.
> > In turn, that means that we have spinning and back-off that give good
> > average-case performance -- whether that's through automatic tuning of
> > those two things at runtime, or through static default values that we do
> > regular performance checks for in the glibc community.
> Spinning and proportional back-off with auto tuning has been proposed for several years
> and never got it merged in upstream kernel.
> IMHO, this is because MCS-style lock wins that battle.
> Could you tell me why we should consider backoff rather than MCS-style lock?
The kernel has no concerns over ABI stability of their locks, AFAIK. But
pthread_mutex_t is exposed to users, and there is the problem of process-
shared mutexes, so we can't just go to an external list.
Furthermore, if we had proper back-off, we could use it for other
synchronization primitives too (grep for comments containing "back-off").
Doe you have any data that would show that MCS-style locks always win on
> > From that perspective, the tunable interface is a nice addition that can
> > allow users to fine-tune the setting, but it's not how users would enable
> > it.
> > > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
> > > algorithm of current adaptive mutex is the major negative factor which degrades mutex
> > > performance.
> > Note that I'm not advocating for focusing on just the adaptive mutex type.
> > IMO, adding this type was a mistake because whether to spin or not does not
> > affect semantics of the mutexes. Performance hints shouldn't be done via a
> > mutex' type, and all mutex implementations should consider to spin at least
> > a little.
> > If we just do something about the adaptive mutexes, then I guess this will
> > reach few users. I believe most applications just don't use them, and the
> > current implementation of adaptive mutexes is so simplistic that there's
> > not much performance to be had by changing to adaptive mutexes (which is
> > another reason for it having few users).
> Generally, I agree with you.
> May we tune adaptive mutex before applying these optimization to normal mutex.
That may be a way to start experimenting with it, as it gives you an
already existing "tunable". However, I would guess that it won't give you
wide testing because I would guess that most programs don't bother changing
their locks to be of the adaptive type.
> > > That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
> > > mutex.
> > MCS-style spinning (ie, spinning on memory local to the spinning thread) is
> > helpful, but I think we should tackle spinning on global memory first (ie,
> > on a location in the mutex, which is shared by all the threads trying to
> > acquire it). Of course, always including back-off.
> > > https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html
> > >
> > > Also, if with very small critical section in the worklad, this new type of mutex
> > > with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
> > > much better than original spinlock.
> > I don't think we want to have a new type for that. It maybe useful for
> > experimenting with it, but it shouldn't be exposed to users as a stable
> > interface.
> I don't like to add a new type either.
> As I said in the commit log, that's a trade-off to avoid ABI changed.
> I am very glad to see that MCS-style lock can be used gracefully without
> introducing a new type.
Well, we'd need to introduce MCS-style locks in a way that does not change
the ABI. If that's possible and done, and performance improves, I don't
see a reason why this shouldn't be accepted.
> > Also, have you experimented with different kinds/settings of exponential
> > back-off? I just saw normal spinning in your implementation, no varying
> > amounts of back-off. The performance comparison should include back-off
> > though, as that's one way to work around the contention problems (with a
> > bigger hammer than local spinning of course, but can be effective
> > nonetheless, and faster in low-contention cases).
> I didn't try back-off, because we don't have to include it if MCS-style lock is used.
But you should at least try to get some decent performance, because it's a
competitor to MCS-style locking. Even if you can't get the same
performance as MCS in some cases, we still need to know so that we can
assess the pros and cons of choosing MCS.
Also, some initial spinning may be useful even with MCS, I think (eg, short
critical sections with just two threads but enough delay between critical
sections by each thread so that there doesn't need to be a lot of