This is the mail archive of the
mailing list for the glibc project.
Re: PowerPC: Sync pthread_once with default implementation
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Thu, 17 Apr 2014 14:02:27 -0300
- Subject: Re: PowerPC: Sync pthread_once with default implementation
- Authentication-results: sourceware.org; auth=none
- References: <534BF5EB dot 4000608 at linux dot vnet dot ibm dot com> <1397750597 dot 10643 dot 23053 dot camel at triegel dot csb>
On 17-04-2014 13:03, Torvald Riegel wrote:
> On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
>> This patch removes the arch specific powerpc implementation and instead
>> uses the linux default one. Although the current powerpc implementation
>> already contains the required memory barriers for correct
>> initialization, the default implementation shows a better performance on
>> newer chips.
>> I checked the default implementations with some other tests for powerpc
>> and everything looks ok. The only nit that was puzzling me was code
>> difference I saw checking if a load acquire instruction sequence would
>> yield better performance than the relaxed load plus acquire memory fence.
>> From Torvald Riegel comment:
>>> Okay. If lwsync is preferable for an acquire load, it might be best to
>>> check the GCC atomics implementation because I believe it's following
>>> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
>>> isync for acquire loads.
>> I rechecked ISA documentation and some literature and the only inhibitor
>> about using an 'lwsync' as an acquire barrier is because it should not
>> be used on 'Write Through Required and Caching Inhibited' storage location,
>> which AFAIK are used only in restricted kernel areas. In fact, load/store
>> with reservations does not work on such kind of memory attributes, so for
>> general userland lock mechanism 'lwsync' is safe to be used as acquire
>> memory fence.
>> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
>> 'lwsync', so a relaxed load plus memory acquire fence will be translated
>> to 'ld; lwsync'. Two advantage by using 'ld; cmp; bc; isync' for
>> acquire loads is 1) it is more strict, thus safer (although still not
>> really required) 2) is it performs much better some chips (POWER6), while
>> showing very similar performance on other POWER platforms.
> Thanks for looking into this.
>> I will see if it is worth to change the current core for acquire load/
>> release store, at least for POWER. The only chip that I saw a big improvement
>> of doing it is on POWER6.
> Sounds good. I think that in the long term, having atomic.h provide
> acquire loads and release stores wouldn't be bad. I'd prefer doing that
> instead of running with a POWER6-specific pthread_once variant.
>> Also, this testcase itself is single-thread and think it would be worth to
>> extend it to check contention on multithread cases as well.
> Probably. Although given that the contented case happens once per
> pthread_once instance, it could be rather hard to trigger; to increase
> the chance that one hits it often, one would need to involve other
> synchronization, which could skew the results.
> The patch is OK in my opinion.
Thanks for the review and in fact my idea of measure contention in a multi-thread environment
is not for pthread_once initialization, but rather the difference of latency of using lwsync
and isync. Anyway, I'll commit this patch.