This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Linux: Implement membarrier function
- From: "Paul E. McKenney" <paulmck at linux dot ibm dot com>
- To: Alan Stern <stern at rowland dot harvard dot edu>
- Cc: David Goldblatt <davidtgoldblatt at gmail dot com>, mathieu dot desnoyers at efficios dot com, Florian Weimer <fweimer at redhat dot com>, triegel at redhat dot com, libc-alpha at sourceware dot org, andrea dot parri at amarulasolutions dot com, will dot deacon at arm dot com, peterz at infradead dot org, boqun dot feng at gmail dot com, npiggin at gmail dot com, dhowells at redhat dot com, j dot alglave at ucl dot ac dot uk, luc dot maranget at inria dot fr, akiyks at gmail dot com, dlustig at nvidia dot com, linux-arch at vger dot kernel dot org, linux-kernel at vger dot kernel dot org
- Date: Sun, 16 Dec 2018 10:51:30 -0800
- Subject: Re: [PATCH] Linux: Implement membarrier function
- References: <20181214184344.GW4170@linux.ibm.com> <Pine.LNX.4.44L0.1812141404450.1570-100000@iolanthe.rowland.org>
- Reply-to: paulmck at linux dot ibm dot com
On Fri, Dec 14, 2018 at 04:39:34PM -0500, Alan Stern wrote:
> On Fri, 14 Dec 2018, Paul E. McKenney wrote:
>
> > I would say that sys_membarrier() has zero-sized read-side critical
> > sections, either comprising a single instruction (as is the case for
> > synchronize_sched(), actually), preempt-disable regions of code
> > (which are irrelevant to userspace execution), or the spaces between
> > consecutive pairs of instructions (as is the case for the newer
> > IPI-based implementation).
> >
> > The model picks the single-instruction option, and I haven't yet found
> > a problem with this -- which is no surprise given that, as you say,
> > an actual implementation makes this same choice.
>
> I believe that for RCU tests the LKMM gives the same results for
> length-zero critical sections interspersed between all the instructions
> and length-one critical sections surrounding all instructions (except
> synchronize_rcu). But the proof is tricky and I haven't checked it
> carefully.
That assertion is completely consistent with my implementation experience,
give or take the usual caveats about idle and offline execution.
> > > > The other thing that took some time to get used to is the possibility
> > > > of long delays during sys_membarrier() execution, allowing significant
> > > > execution and reordering between different CPUs' IPIs. This was key
> > > > to my understanding of the six-process example, and probably needs to
> > > > be clearly called out, including in an example or two.
> > >
> > > In all the examples I'm aware of, no more than one of the IPIs
> > > generated by each sys_membarrier call really matters. (Of course,
> > > there's no way to know in advance which one it will be, so you have to
> > > send an IPI to every CPU.) The execution delays and reordering
> > > between different CPUs' IPIs don't appear to be significant.
> >
> > Well, there are litmus tests that are allowed in which the allowed
> > execution is more easily explained in terms of delays between different
> > CPUs' IPIs, so it seems worth keeping track of.
> >
> > There might be a litmus test that can tell the difference between
> > simultaneous and non-simultaneous IPIs, but I cannot immediately think of
> > one that matters. Might be a failure of imagination on my part, though.
>
> P0 P1 P2
> Wc=1 [mb01] Rb=1
> memb Wa=1 Rc=0
> Ra=0 Wb=1 [mb02]
>
> The IPIs have to appear in the positions shown, which means they cannot
> be simultaneous. The test is allowed because P2's reads can be
> reordered.
OK, so "simultaneous" IPIs could be emulated in a real implementation by
having sys_membarrier() send each IPI (but not wait for a response), then
execute a full memory barrier and set a shared variable. Each IPI handler
would spin waiting for the shared variable to be set, then execute a full
memory barrier and atomically increment yet another shared variable and
return from interrupt. When that other shared variable's value reached
the number of IPIs sent, the sys_membarrier() would execute its final
(already existing) full memory barrier and return. Horribly expensive
and definitely not recommended, but eminently doable.
The difference between current sys_membarrier() and the "simultaneous"
variant described above is similar to the difference between
non-multicopy-atomic and multicopy-atomic memory ordering. So, after
thinking it through, my guess is that pretty much any litmus test that
can discern between multicopy-atomic and non-multicopy-atomic should
be transformable into something that can distinguish between the current
and the "simultaneous" sys_membarrier() implementation.
Seem reasonable?
Or alternatively, may I please apply your Signed-off-by to your earlier
sys_membarrier() patch so that I can queue it? I will probably also
change smp_memb() to membarrier() or some such. Again, within the
Linux kernel, membarrier() can be emulated with smp_call_function()
invoking a handler that does smp_mb().
Thanx, Paul