This is the mail archive of the
mailing list for the glibc project.
Re: Transition to C11 atomics and memory model
- From: Torvald Riegel <triegel at redhat dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, David Miller <davem at davemloft dot net>
- Date: Mon, 15 Sep 2014 21:52:19 +0200
- Subject: Re: Transition to C11 atomics and memory model
- Authentication-results: sourceware.org; auth=none
- References: <1410719669 dot 4967 dot 160 dot camel at triegel dot csb> <54170CCC dot 6030706 at redhat dot com>
On Mon, 2014-09-15 at 11:59 -0400, Carlos O'Donell wrote:
> Thanks for the email, very good questions.
> SPARC pre-v9 question at the bottom for you.
> On 09/14/2014 02:34 PM, Torvald Riegel wrote:
> > I think we should transition to using the C11 memory model and atomics
> > instead of the current custom implementation. There are two reasons for
> > this:
> Architecturally I think that glibc transitioning to the C11 memory model
> is the *only* way forward.
> > I propose that our phase in transitioning to C11 is to focus on uses of
> > the atomic operations. In particular, the rules are:
> > * All accesses to atomic vars need to use atomic_* functions. IOW, all
> > non-atomic accesses are not subject to data races. The only exceptions
> > is initialization (ie, when the variable is not visible to any other
> > thread); nonetheless, initialization accesses must not result in data
> > races with other accesses. (This exception isn't allowed by C11, but
> > eases the transition to C11 atomics and likely works fine in current
> > implementations; as alternative, we could require MO-relaxed stores for
> > initialization as well.)
> At present we rely on small word-length writes to complete atomically,
> would you suggest we have to wrap those in true atomic operations?
Yes! That's what the C11 model requires. And we don't use special
types for atomically accessed variables, so if we don't use atomic ops
for every access there's no way for the compiler to figure out which
memory locations are accessed concurrently, and which aren't.
> Won't this hurt performance?
I don't think so. We'll use memory_order_relaxed in each case where we
have a plain memory access right now to concurrently accessed data. The
only reasons I can think of that might lead to a decrease in performance
* Inefficient implementation of atomics, for example due to a too old
compiler version (see the example that Joseph brought up). That can be
worked around on a arch-specific basis, for example by using inline-asm
* Compilers that don't optimize across memory_order_relaxed atomic ops
and glibc code actually benefits from optimizations by the compiler
across current plain memory accesses. I doubt that this actually
happens in practice, because it would need a loop or such and other
things in the loop would need to be performance-critical -- which is not
a pattern I think is frequent in concurrent code.
* If we currently have code where the compiler combines several plain
memory accesses to concurrently accessed data into one, then we could
have more accesses if using memory_order_relaxed atomics. However, such
an optimization can easily be not what the programmer intended to happen
(e.g., if in a busy-waiting loop -- hence atomic_forced_read...).
> What correctness issue exists?
The compiler needs to know for which memory accesses it can assume
data-race-freedom and for which it doesn't. This results in different
optimizations being valid, or not. You can try to make the compiler do
the right thing by forcing it towards doing that, but the clean way is
to actually tell the compiler what it should do.
For example, there where bugs fixed in GCC for optimizations across or
involving atomic operations; these optimizations where valid for
sequential code, but not in a multi-threaded setting. This shows that
things can go wrong if we don't tell the compiler.
> > * Atomic vars aren't explicitly annotated with atomic types, but just
> > use the base types. They need to be naturally aligned. This makes the
> > transition easier because we don't get any dependencies on C11 atomic
> > types.
> > * On a certain architecture, we typically only use the atomic_* ops if
> > the HW actually supports these; we expect to have pointer-sized atomics
> > at most. If the arch has no native support for atomics, it can either
> > use modified algorithms or emulate atomics differently.
> I strongly suggest all such machines should emulate atomics in the kernel
> using kernel-level locks.
That's fine for me. I just didn't want to make this decision for
> The downside of this is that all atomic vars
> must use atomic_* functions because otherwise the release of the lock
> word by a normal store won't order correctly. This already happened on
> hppa with userspace spinlocks.
But that's The Right Thing To Do anyway, so I don't see it as a
IMO, when writing concurrent code, the least of your troubles is writing
foo = 23;
IMHO, it actually aids in readability because it shows where concurrency
has to be considered and where not (ie, for all data accessed by only
> > * The atomic ops are similar to the _explicit variation of C11's
> > functions, except that _explicit is replaced with the last part of the
> > MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst). All
> > arguments (except the MO, which is dropped) are the same as for C11.
> > That avoids using the same names yet should make the names easy to
> > understand for people familiar with C11.
> > I also propose an incremental transition. In particular, the steps are
> > roughly:
> > 1) Add new C11-like atomics. If GCC supports them on this architecture,
> > use GCC's atomic builtins. Make them fall back to the existing atomics
> > otherwise. Attached is a small patch that illustrates this.
> > 2) Refactor one use (ie, all the accesses belonging to one algorithm or
> > group of functions that synchronize with each other) at a time. This
> > involves reviewing the code and basically reimplementing the
> > synchronization bits in on top of the C11 memory model. We also should
> > take this opportunity to add any documentation of concurrent code that's
> > missing (which is often the case).
> Not OK until we talk about it more.
So, what do you want to talk about? ;)
> > 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
> > at all uses and decide whether we really need to keep them.
> Agreed e.g. rewrite.
> > 4) Once all of glibc uses the new atomics, remove the old ones for a
> > particular arch if the oldest compiler required has support for the
> > respective builtins.
> > Open questions:
> > * Are the current read/write memory barriers equivalent to C11
> > acquire/release fences? I guess that's the case (did I mention lack of
> > documentation? ;) ), but we should check whether this is true on every
> > architecture (ie, whether the HW instructions used for read/write
> > membars are the same as what the compiler would use for
> > acquire/release). If not, we can't implement acquire/release based on
> > read/write membars but need something else for this arch. I'd
> > appreciate help from the machine maintainers for this one.
> Don't know.
> Create an internals manual? Add a new chapter on atomics? :-)
I had hoped we can do with comments in atomic.h, given that we basically
just need to explain our exceptions.
I agree we should have documentation for the transition, but I also
think this just documents existing state. Every new architecture added
should target C11 right away (and every new architecture's designer
should have better made it clear how to implement C11 on top of this
> > * How do we deal with archs such as older SPARC that don't have CAS and
> > other archs without HW support for atomics? Using modified algorithms
> > should be the best-performing option (eg, if we can use one critical
> > section instead of a complicated alternative that uses lots of atomic
> > operations). However, that means we'll have to maintain more algorithms
> > (even if they might be simpler).
> No. Stop. One algorithm. All arches that can't meet the HW support for
> atomics must enter the kernel and do the work there. This is just like hppa
> and ARM do. They use a light-weight syscall mechanism and serialize in the
Fine with me.
> > Furthermore, do all uses of atomics work well with blocking atomics that
> > might also not be indivisible steps? For example, the cancellation code
> > might be affected because a blocking emulation of atomics won't be
> > async-cancel-safe?
> It will be safe because the kernel emulation should not deliver a signal
> during the emulation.
What about PI code? If we have a non-blocking algorithm in glibc (using
nonblocking atomics), then this will work fine with hard real time
priorities; when using blocking atomics, we might get priority