This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PowerPC: Sync pthread_once with default implementation

On 17-04-2014 13:03, Torvald Riegel wrote:
> On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
>> This patch removes the arch specific powerpc implementation and instead
>> uses the linux default one.  Although the current powerpc implementation
>> already contains the required memory barriers for correct
>> initialization, the default implementation shows a better performance on
>> newer chips.
>> I checked the default implementations with some other tests for powerpc
>> and everything looks ok.  The only nit that was puzzling me was code
>> difference I saw checking if a load acquire instruction sequence would
>> yield better performance than the relaxed load plus acquire memory fence.
>> From Torvald Riegel comment:
>>> Okay.  If lwsync is preferable for an acquire load, it might be best to
>>> check the GCC atomics implementation because I believe it's following
>>>, which suggests
>>> isync for acquire loads.
>> I rechecked ISA documentation and some literature and the only inhibitor
>> about using an 'lwsync' as an acquire barrier is because it should not
>> be used on 'Write Through Required and Caching Inhibited' storage location,
>> which AFAIK are used only in restricted kernel areas.  In fact, load/store
>> with reservations does not work on such kind of memory attributes,  so for
>> general userland lock mechanism 'lwsync' is safe to be used as acquire
>> memory fence.
>> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
>> 'lwsync', so a relaxed load plus memory acquire fence will be translated
>> to 'ld; lwsync'.  Two advantage by using 'ld; cmp; bc; isync' for
>> acquire loads is 1) it is more strict, thus safer (although still not
>> really required) 2) is it performs much better some chips (POWER6), while
>> showing very similar performance on other POWER platforms.
> Thanks for looking into this.
>> I will see if it is worth to change the current core for acquire load/
>> release store, at least for POWER.  The only chip that I saw a big improvement
>> of doing it is on POWER6.
> Sounds good.  I think that in the long term, having atomic.h provide
> acquire loads and release stores wouldn't be bad.  I'd prefer doing that
> instead of running with a POWER6-specific pthread_once variant.
>> Also, this testcase itself is single-thread and think it would be worth to
>> extend it to check contention on multithread cases as well.
> Probably.  Although given that the contented case happens once per
> pthread_once instance, it could be rather hard to trigger; to increase
> the chance that one hits it often, one would need to involve other
> synchronization, which could skew the results.
> The patch is OK in my opinion.

Thanks for the review and in fact my idea of measure contention in a multi-thread environment
is not for pthread_once initialization, but rather the difference of latency of using lwsync
and isync.  Anyway, I'll commit this patch.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]