This is the mail archive of the
mailing list for the libc-ports project.
Re: PI mutex support for pthread_cond_* now in nptl
On Wed, 2013-02-20 at 10:59 -0600, Steven Munroe wrote:
> On Tue, 2013-02-19 at 21:06 +0100, Torvald Riegel wrote:
> > On Tue, 2013-02-19 at 17:18 +0000, Joseph S. Myers wrote:
> > > On Tue, 19 Feb 2013, Richard Henderson wrote:
> > >
> > > > Any chance we can move these macros into a generic linux header?
> > > > Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> > > > the same for all targets.
> > >
> > > Generally most of lowlevellock.h should probably be shared between
> > > architectures. (If some architectures don't implement a particular
> > > feature as of a particular kernel version, that's a matter for
> > > kernel-features.h and __ASSUME_* conditionals.)
> > On a related note: What are the reasons to have arch-specific assembler
> > versions of many of the synchronization operations? I would be
> > surprised if they'd provide a significant performance advantage; has
> > anyone recent measurements for this?
> The introduction of GCC compiler builtins like __sync is fairly recent
> and the new __atomic builtins start with GCC-4.7. So until recently we
> had no choice.
Using assembler for the atomic operations is possible (e.g., as in
Boehm's libatomic-ops, or in./sysdeps/powerpc/bits/atomic.h and others).
It doesn't allow for the same level of compiler optimization across
barriers, but it's unclear whether that has much benefit, and GCC
doesn't do it yet anyway.
There are some cases in which compilers that don't support the C11/C++11
memory model can generate code that wouldn't be correct in such a model,
and which can theoretically interfere with other concurrent code (e.g.,
introduce data races due to accesses being too wide). However, because
we don't have custom assembler for everything, we should be already
exposed to that.
> For platforms (like PowerPC) that implement acquire/release the GCC
> __sync builtins are not sufficient and GCC-4.7 __atomic builtins are not
> pervasive enough to make that the default.
I agree regarding the __sync builtins, but using assembler in place of
the __atomic builtins should work, or not?
> > It seems to me that it would be useful to consolidate the different
> > versions that exist for the synchronization operations into shared C
> > code as long as this doesn't make a significant performance difference.
> > They are all based on atomic operations and futex operations, both of
> > which we have in C code (especially if we have compilers that support
> > the C11 memory model). Or are there other reasons for keeping different
> > versions that I'm not aware of?
> I disagree. The performance of lowlevellocks and associated platform
> specific optimizations are too import to move forward with the
> consolidation you suggest.
Which specific optimizations do you refer to? I didn't see any for
powerpc, for example (i.e., the lock fast path is C up to the point of
the atomic operation). The ones that I saw are for x86, and I'm
wondering whether they provide much benefit. Especially because this
can mostly just matter for the execution path taken when a free lock is
acquired; once you get any cache miss, you're to some extent on the slow
path anyway. Also, for the Linux platforms I looked at, the mutex
algorithms are the same.
Do you have any recent measurements (or could point to them) that show
the benefit of the optimizations you refer to?
For example, we've spent quite some time debugging a PI cond var failure
in the past, and this wasn't made any easier by having several
(different) versions of the cond var implementation.