[PATCH 2/2] manual: Document __libc_single_threaded

Rich Felker dalias@libc.org
Wed Jun 3 17:11:26 GMT 2020


On Wed, Jun 03, 2020 at 05:00:01PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> >> I think it makes more sense not to declare the object as volatile and
> >> make sure that only libc functions which imply the required barrier
> >> write to __libc_single_threaded.  For instance, I expect that this will
> >> allow compilers to generate tighter code around multiple (implied)
> >> reference count updates.
> >
> > It really should be volatile just because whatever you make it is ABI,
> > and there might be good reasons to want to make it updated
> > "asynchronously" with respect to the compiler's model in the future.
> > If you don't make it volatile now you can't do that in the future.
> 
> volatile does not mean that it is safe to update it asynchronously.
> Some barriers are still needed.

If by "asynchronously" we just mean "some code running in the same
task context which the compiler isn't aware of, like a signal handler
or internals of malloc or of a pure function" then yes volatile is
sufficient. This is basically the definition of volatile.

If you mean modification from other threads, then you need additional
synchronization to ensure that the change is seen, but if you're ok
with a change not being seen (roughly the same as relaxed order) then
volatile suffices as a valid way to model it. (The potential change is
an asynchronous change by the hardware, where the hardware in question
happens to be the cache/memory coherency implementation.)

> And if we need acquire MO for the
> access, then the purpose of the variable is completely defeated.

No. Reading a stale value (false instead of true) is perfectly fine.
The other direction cannot happen because the change from true to
false can fundamentally happen only in a single-threaded context. And
if you don't want to read a stale value, SYS_membarrier performed by
the thread changing the value ensures that you won't. But that's not
needed.

> Is it really a significant restriction on implementations if they can
> only set __libc_single_threaded to true from functions which clearly
> imply an acquire barrier?  I don't think so.  pthread_join qualifies,
> and that seems the most important case/

Again doing it only from pthread_join will completely omit the
optimization in any program that waits to join until is has no further
work to do, which is a very reasonable thing to do.

> > The cost of volatile is rather trivial; compiler can still cache
> > address of it, and loading from a cache line you recently loaded from,
> > and to which no stores are happening, is virtually free.
> 
> There is a code size impact because the compiler can no longer merge
> several atomic/non-atomic alternatives.  Consider this example, which I
> think it is not too unrealistic (hopefully, std::shared_ptr will
> eventually compile down to something similar):
> 
> #include <stddef.h>
> 
> struct refcounted
> {
>   size_t count;
> };
> 
> extern char __libc_single_threaded;
> 
> static inline
> void ref (struct refcounted *p)
> {
>   if (__libc_single_threaded)
>     ++p->count;
>   else
>     __atomic_fetch_add (&p->count, 1, __ATOMIC_RELAXED);
> }
> 
> 
> void f1 (struct refcounted *a, struct refcounted *b);
> 
> void
> f2 (struct refcounted *a, struct refcounted *b)
> {
>   /* f1 takes ownership of both objects, but the caller of f2 is
>      expected to retain (shared) ownership as well. */
>   ref (a);
>   ref (b);
>   f1 (a, b);
> }
> 
> As written, this turns into:
> 
> f2:
> 	cmpb	$0, __libc_single_threaded(%rip)
> 	je	.L2
> 	addq	$1, (%rdi)
> ..L3:
> 	addq	$1, (%rsi)
> 	jmp	f1
> ..L2:
> 	lock addq	$1, (%rdi)
> 	cmpb	$0, __libc_single_threaded(%rip)
> 	jne	.L3
> 	lock addq	$1, (%rsi)
> 	jmp	f1
> 
> The jump back to .L3 is due to the current reluctance to optimize
> relaxed MO accesses, due to the brokenness of the C++ memory model
> (although it would be quite safe here, I assume).
> 
> With volatile, it's this instead:
> 
> f2:
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	je	.L2
> 	addq	$1, (%rdi)
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	je	.L4
> ..L7:
> 	addq	$1, (%rsi)
> 	jmp	f1
> ..L2:
> 	lock addq	$1, (%rdi)
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	jne	.L7
> ..L4:
> 	lock addq	$1, (%rsi)
> 	jmp	f1
> 
> So two checks in the non-atomic case, plus a load for the volatile
> access.

I see; indeed for something on this small a scale, it *might* matter.
Has anyone done any measurement to determine whether it does?

Rich


More information about the Libc-alpha mailing list