This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Unify pthread_once (bug 15215)


On 7 April 2014 14:16, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-04-07 at 13:46 +0100, Will Newton wrote:
>> On 7 April 2014 13:37, Torvald Riegel <triegel@redhat.com> wrote:
>> > On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
>> >> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
>> >>
>> >> We would like to unify all C-based pthread_once implmentations
>> >> per the plan in bug 15215 for glibc 2.20.
>> >>
>> >> Your machines are on the list of C-based pthread_once implementations.
>> >>
>> >> See this for the intial discussions on the unified pthread_once:
>> >> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
>> >>
>> >> The goal is to provide a single and correct C implementation of
>> >> pthread_once. Architectures can then build on that if they need more
>> >> optimal implementations, but I don't encourage that and I'd rather
>> >> see deep discussions on how to make one unified solution where
>> >> possible.
>> >>
>> >> I've also just reviewed Torvald's new pthread_once microbenchmark which
>> >> you can use to compare your previous C implementation with the new
>> >> standard C implementation (measures pthread_once latency). The primary
>> >> use of this test is to help provide objective proof for or against the
>> >> i386 and x86_64 assembly implementations.
>> >>
>> >> We are not presently converting any of the machines with custom
>> >> implementations, but that will be a next step after testing with the
>> >> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
>> >>
>> >> If we don't hear any objections we will go forward with this change
>> >> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
>> >> and aarch64 on a single pthread_once implementation based on sparc's C
>> >> implementation.
>> >
>> > So far, I've seen an okay for tile, and a question about ARM.  Will, are
>> > you okay with the change for ARM?
>>
>> From a correctness and maintainability standpoint it looks good. I
>> have concerns about the performance but I will leave that call to the
>> respective ARM and AArch64 maintainers.
>>
>> In your original post you speculate it may be possible to improve
>> performance on ARM:
>>
>> "I'm currently also using the existing atomic_{read/write}_barrier
>> functions instead of not-yet-existing load_acq or store_rel functions.
>> I'm not sure whether the latter can have somewhat more efficient
>> implementations on Power and ARM; if so, and if you're concerned about
>> the overhead, we can add load_acq and store_rel to atomic.h and start
>> using it"
>>
>> It would be interesting to know how much work that would be and what
>> the performance improvements might be like.
>
> I had a quick look at the arm and aarch64 barrier definitions, and they
> only define a full barrier, but not separate read / write barriers.
> That is part of the performance problem I believe, since a full barrier
> should be significantly more costly than an acquire barrier.
>
> I guess read/write barriers as used in glibc are semantically equivalent
> to acquire / release as in C11, but I'm not quite sure given that some
> architectures use stronger barriers for read/write than acquire/release.
> Cleaning that up would require review of plenty of code.  But one could
> start incrementally as well by not changing existing barrier definitions
> and reviewing uses one by one.  In the long term, I think we would
> benefit from using C11 atomics throughout glibc; in some cases, existing
> custom assembly might be faster (e.g., that has been one comment
> regarding, IIRC, powerpc low-level locks) -- but maybe we can achieve
> this with custom memory orders for atomics as well, or something
> similar.
> In any way, cleaning this up is not specific to pthread_once.
>
> Second, suggested mappings from C11 acquire/release to arm
> (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) show differences
> for acquire loads and acquire barriers, but I don't know whether these
> would result in a performance difference.

ARMv8 (ARM and AArch64) defines load-acquire store-release
instructions so for these systems we can do better than dmb. Hopefully
we can just use the C11 API to access them but I haven't tested to see
if gcc can actually do the right thing...

-- 
Will Newton
Toolchain Working Group, Linaro


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]