This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: PowerPC: Sync pthread_once with default implementation
- From: Torvald Riegel <triegel at redhat dot com>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: "GNU C. Library" <libc-alpha at sourceware dot org>, "Steven J. Munroe" <sjmunroe at us dot ibm dot com>
- Date: Thu, 17 Apr 2014 18:03:17 +0200
- Subject: Re: PowerPC: Sync pthread_once with default implementation
- Authentication-results: sourceware.org; auth=none
- References: <534BF5EB dot 4000608 at linux dot vnet dot ibm dot com>
On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
> This patch removes the arch specific powerpc implementation and instead
> uses the linux default one. Although the current powerpc implementation
> already contains the required memory barriers for correct
> initialization, the default implementation shows a better performance on
> newer chips.
>
> I checked the default implementations with some other tests for powerpc
> and everything looks ok. The only nit that was puzzling me was code
> difference I saw checking if a load acquire instruction sequence would
> yield better performance than the relaxed load plus acquire memory fence.
> From Torvald Riegel comment:
>
> > Okay. If lwsync is preferable for an acquire load, it might be best to
> > check the GCC atomics implementation because I believe it's following
> > http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
> > isync for acquire loads.
>
> I rechecked ISA documentation and some literature and the only inhibitor
> about using an 'lwsync' as an acquire barrier is because it should not
> be used on 'Write Through Required and Caching Inhibited' storage location,
> which AFAIK are used only in restricted kernel areas. In fact, load/store
> with reservations does not work on such kind of memory attributes, so for
> general userland lock mechanism 'lwsync' is safe to be used as acquire
> memory fence.
>
> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
> 'lwsync', so a relaxed load plus memory acquire fence will be translated
> to 'ld; lwsync'. Two advantage by using 'ld; cmp; bc; isync' for
> acquire loads is 1) it is more strict, thus safer (although still not
> really required) 2) is it performs much better some chips (POWER6), while
> showing very similar performance on other POWER platforms.
Thanks for looking into this.
> I will see if it is worth to change the current core for acquire load/
> release store, at least for POWER. The only chip that I saw a big improvement
> of doing it is on POWER6.
Sounds good. I think that in the long term, having atomic.h provide
acquire loads and release stores wouldn't be bad. I'd prefer doing that
instead of running with a POWER6-specific pthread_once variant.
> Also, this testcase itself is single-thread and think it would be worth to
> extend it to check contention on multithread cases as well.
Probably. Although given that the contented case happens once per
pthread_once instance, it could be rather hard to trigger; to increase
the chance that one hits it often, one would need to involve other
synchronization, which could skew the results.
The patch is OK in my opinion.