This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)
- From: "Boehm, Hans" <hans dot boehm at hp dot com>
- To: Lawrence Crowl <crowl at google dot com>
- Cc: "Joseph S. Myers" <joseph at codesourcery dot com>, Richard Guenther<richard dot guenther at gmail dot com>, Andrew Haley <aph at redhat dot com>, Paolo Bonzini<bonzini at gnu dot org>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>,"libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Date: Wed, 9 Sep 2009 23:41:36 +0000
- Subject: RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)
- References: <4A82E93B.5010504@redhat.com> <Pine.LNX.4.64.0908140013110.8510@digraph.polyomino.org.uk> <29bd08b70908141131k34cb0929o537ad7762364a4d1@mail.gmail.com> <Pine.LNX.4.64.0908141910040.21915@digraph.polyomino.org.uk> <29bd08b70908141356h527ae059h655914ac30b74f30@mail.gmail.com> <Pine.LNX.4.64.0908142244390.21915@digraph.polyomino.org.uk> <29bd08b70908171516taa15961y5aa57fc7199b8a67@mail.gmail.com> <Pine.LNX.4.64.0908172220190.26234@digraph.polyomino.org.uk> <29bd08b70908191551v6fa0370dp9e3bad19e21ae76a@mail.gmail.com> <238A96A773B3934685A7269CC8A8D042577A1AD998@GVW0436EXB.americas.hpqcorp.net><29bd08b70909091551i6be259e4m8737791f3f7fbb60@mail.gmail.com>
> From: Lawrence Crowl [mailto:crowl@google.com]
>
> On 8/20/09, Boehm, Hans <hans.boehm@hp.com> wrote:
> > > -----Original Message-----
> > > From: Lawrence Crowl [mailto:crowl@google.com] The
> problem is that
> > > gcc does support 80386. It also supports other
> processors that have
> > > less-than-complete support for concurrency. Just in the
> x86 line,
> > > we get some additional capability in many new layers.
> > >
> > > 8086 LOCK XCHG
> > > 80486 CMPXCHG XADD
> > > Pentium CMPXCHG8B
> > > SSE SFENCE
> >
> > Aside to an interesting discussion:
> >
> > I believe the current conclusion is that SFENCE should be ignored,
> > except for library or compiler-generated code that uses
> > non-temporal/coalescing stores, which I believe are also a recent
> > addition. Normal stores are ordered anyway, so it's not needed.
> > Thus you are faced with a choice of either (a) implementing
> fences on
> > the assumption that ordinary code may contain non-temporal
> stores, or
> > (b) making sure that non-temporal stores are always
> surrounded by the
> > appropriate fences. This is really an important ABI issue,
> but it's
> > something that I believe no ABI currently specifies. Our
> conclusion
> > in earlier discussions among a different group of people
> was that (b)
> > made more sense, since non-temporal stores of various kinds
> seemed to
> > be largely confined to a few library routines.
>
> Hm. I would expect that given the C++0x memory model,
> compilers could be much more aggressive about using
> non-temporal stores, potentially improving performance
> substantially. That is, it may be better to accept a
> slightly less efficient ABI for today's compilers to gain a
> more efficient ABI for tomorrow's compilers.
>
> > It would be really nice if everyone somehow managed to
> agree on this.
> > Inconsistency here, probably even between Windows and Linux, seems
> > likely to result in really subtle bugs.
> >
> > Note that this also affects correctness of spinlock
> implementations,
> > not just atomics. A simple store to release a lock doesn't work if
> > the critical section may contain unfenced non-temporal stores.
>
> Yes, but the spinning acquire doesn't require the fence, only
> the the release. So, is this additional instruction a
> performance problem?
>
I haven't looked at this terribly systematically. I do know that in Pentium 4 days, sfence was tremendously expensive (basically equivalent to mfence or cmpxchg, i.e. 100+ cycles), even in contexts in which it was a no-op. Thus ABI convention (a) roughly doubles the (already very high) cost of an uncontended spin-lock on a Pentium 4. I suspect that got better on later implementations, but I'm not sure by how much.
I think the only nontemporal stores on X86 are vector instructions. I would guess that for many applications neither these nor spin-lock times matter a lot, and for most of the rest, these vector instructions won't make up for the cost of doubling spin-lock execution times. If you do manage to automatically generate non-temporal stores at all, you will usually generate a bunch of them between potential synchronization operations, so that you can amortize the sfence. As I recall, we did look briefly during earlier discussions, and didn't find them used much even in hand-crafted libc code.
But this is all hand-waving and guessing. Certainly real measurements would be much better.
The most important issue of course is that we need to stick to one convention or the other. Currently a lot of code seems to assume that an X86 spin lock can be released with a simple store, so invalidating that would be tricky, especially since sfence was a fairly recent introduction.
Hans