This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)

From: "Boehm, Hans" <hans dot boehm at hp dot com>
To: Lawrence Crowl <crowl at google dot com>
Cc: "Joseph S. Myers" <joseph at codesourcery dot com>, Richard Guenther<richard dot guenther at gmail dot com>, Andrew Haley <aph at redhat dot com>, Paolo Bonzini<bonzini at gnu dot org>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>,"libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
Date: Wed, 9 Sep 2009 23:41:36 +0000
Subject: RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)
References: <4A82E93B.5010504@redhat.com> <Pine.LNX.4.64.0908140013110.8510@digraph.polyomino.org.uk> <29bd08b70908141131k34cb0929o537ad7762364a4d1@mail.gmail.com> <Pine.LNX.4.64.0908141910040.21915@digraph.polyomino.org.uk> <29bd08b70908141356h527ae059h655914ac30b74f30@mail.gmail.com> <Pine.LNX.4.64.0908142244390.21915@digraph.polyomino.org.uk> <29bd08b70908171516taa15961y5aa57fc7199b8a67@mail.gmail.com> <Pine.LNX.4.64.0908172220190.26234@digraph.polyomino.org.uk> <29bd08b70908191551v6fa0370dp9e3bad19e21ae76a@mail.gmail.com> <238A96A773B3934685A7269CC8A8D042577A1AD998@GVW0436EXB.americas.hpqcorp.net><29bd08b70909091551i6be259e4m8737791f3f7fbb60@mail.gmail.com>

> From: Lawrence Crowl [mailto:crowl@google.com] 
> 
> On 8/20/09, Boehm, Hans <hans.boehm@hp.com> wrote:
> > > -----Original Message-----
> > > From: Lawrence Crowl [mailto:crowl@google.com] The 
> problem is that 
> > > gcc does support 80386.  It also supports other 
> processors that have 
> > > less-than-complete support for concurrency.  Just in the 
> x86 line, 
> > > we get some additional capability in many new layers.
> > >
> > >   8086        LOCK XCHG
> > >   80486       CMPXCHG XADD
> > >   Pentium     CMPXCHG8B
> > >   SSE         SFENCE
> >
> > Aside to an interesting discussion:
> >
> > I believe the current conclusion is that SFENCE should be ignored, 
> > except for library or compiler-generated code that uses 
> > non-temporal/coalescing stores, which I believe are also a recent 
> > addition.  Normal stores are ordered anyway, so it's not needed.
> > Thus you are faced with a choice of either (a) implementing 
> fences on 
> > the assumption that ordinary code may contain non-temporal 
> stores, or 
> > (b) making sure that non-temporal stores are always 
> surrounded by the 
> > appropriate fences.  This is really an important ABI issue, 
> but it's 
> > something that I believe no ABI currently specifies.  Our 
> conclusion 
> > in earlier discussions among a different group of people 
> was that (b) 
> > made more sense, since non-temporal stores of various kinds 
> seemed to 
> > be largely confined to a few library routines.
> 
> Hm.  I would expect that given the C++0x memory model, 
> compilers could be much more aggressive about using 
> non-temporal stores, potentially improving performance 
> substantially.  That is, it may be better to accept a 
> slightly less efficient ABI for today's compilers to gain a 
> more efficient ABI for tomorrow's compilers.
> 
> > It would be really nice if everyone somehow managed to 
> agree on this.
> > Inconsistency here, probably even between Windows and Linux, seems 
> > likely to result in really subtle bugs.
> >
> > Note that this also affects correctness of spinlock 
> implementations, 
> > not just atomics.  A simple store to release a lock doesn't work if 
> > the critical section may contain unfenced non-temporal stores.
> 
> Yes, but the spinning acquire doesn't require the fence, only 
> the the release.  So, is this additional instruction a 
> performance problem?
> 
I haven't looked at this terribly systematically.  I do know that in Pentium 4 days, sfence was tremendously expensive (basically equivalent to mfence or cmpxchg, i.e. 100+ cycles), even in contexts in which it was a no-op.  Thus ABI convention (a) roughly doubles the (already very high) cost of an uncontended spin-lock on a Pentium 4.  I suspect that got better on later implementations, but I'm not sure by how much.

I think the only nontemporal stores on X86 are vector instructions.  I would guess that for many applications neither these nor spin-lock times matter a lot, and for most of the rest, these vector instructions won't make up for the cost of doubling spin-lock execution times.  If you do manage to automatically generate non-temporal stores at all, you will usually generate a bunch of them between potential synchronization operations, so that you can amortize the sfence.  As I recall, we did look briefly during earlier discussions, and didn't find them used much even in hand-crafted libc code.

But this is all hand-waving and guessing.  Certainly real measurements would be much better.

The most important issue of course is that we need to stick to one convention or the other.  Currently a lot of code seems to assume that an X86 spin lock can be released with a simple store, so invalidating that would be tricky, especially since sfence was a fairly recent introduction.

Hans

References:
- Re: Implementing C++1x and C1x atomics (really an aside on SFENCE)
  - From: Lawrence Crowl

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]