This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Future atomic operation cleanup
- From: Torvald Riegel <triegel at redhat dot com>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Fri, 29 Aug 2014 15:27:54 +0200
- Subject: Re: Future atomic operation cleanup
- Authentication-results: sourceware.org; auth=none
- References: <53F74A93 dot 30508 at linux dot vnet dot ibm dot com> <53F74CE4 dot 5070809 at twiddle dot net> <53FF3551 dot 8020503 at linux dot vnet dot ibm dot com> <1409240266 dot 6045 dot 62 dot camel at triegel dot csb> <53FF5889 dot 5080006 at linux dot vnet dot ibm dot com> <1409306115 dot 6045 dot 71 dot camel at triegel dot csb> <54005FAA dot 9000802 at linux dot vnet dot ibm dot com>
On Fri, 2014-08-29 at 08:10 -0300, Adhemerval Zanella wrote:
> On 29-08-2014 06:55, Torvald Riegel wrote:
> > On Thu, 2014-08-28 at 13:27 -0300, Adhemerval Zanella wrote:
> >> On 28-08-2014 12:37, Torvald Riegel wrote:
> >>> On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
> >>>> On 22-08-2014 11:00, Richard Henderson wrote:
> >>>>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Following comments from my first patch to optimize single-thread internal
> >>>>>> glibc locking/atomics [1], I have changed the implementation to use now
> >>>>>> relaxed atomics instead. Addresing the concerns raised in last discussion,
> >>>>>> the primitives are still signal-safe (although not thread-safe), so if future
> >>>>>> malloc implementation is changed to be async-safe, it won't require to a
> >>>>>> adjust powerpc atomics.
> >>>>>>
> >>>>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> >>>>>> (which powerpc is currently using) and implemented the atomics with acquire
> >>>>>> semantics. The new implementation also is simpler.
> >>>>>>
> >>>>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> >>>>>> calls and an performance increase of 7-8% in 483.xalancbmk from
> >>>>>> speccpu2006 (number from an POWER8 machine).
> >>>>>>
> >>>>>> Checked on powerpc64, powerpc32 and powerpc64le.
> >>>>> Wow, that's a lot of boiler-plate.
> >>>>>
> >>>>> When development opens again, can we simplify all of these atomic operations by
> >>>>> assuming compiler primitives? That is, use the __atomic builtins for gcc 4.8
> >>>>> and later, fall back to the __sync builtins for earlier gcc, and completely
> >>>>> drop support for truly ancient compilers that support neither.
> >>>>>
> >>>>> As a bonus we'd get to unify the implementations across all of the targets.
> >>>>>
> >>>>>
> >>>>> r~
> >>>>>
> >>>> I also agree we should move to more a unified implementation (in fact, I
> >>>> plan to get rid of powerpc lowlevellock.h when devel opens again). However
> >>>> I really don't want to either wait or reimplement all the custom atomic to push
> >>>> this optimization...
> >>> I believe that, unless the caveat Joseph mentioned actually applies to
> >>> the archs you're concerned about, using the compiler builtins and doing
> >>> the unification would simplify your patch considerably. It would also
> >>> avoid having to iterate over the changes of your patch again once we do
> >>> all of the unification.
> >>>
> >>>> I think such change will require a lot of iteration and testing, which is not
> >>>> the intend of this patch.
> >>> If we move to C11-like atomics, then those will certainly go in
> >>> incrementally, and exist in parallel with the current atomics for a
> >>> while (until we reviewed all existing code using the old atomics and
> >>> moved it over to the new atomics).
> >>>
> >>> If we use the compiler builtins, we'd also have to test less because we
> >>> have no additional custom atomics implementation we need to maintain.
> >>>
> >> I do agree moving to a compiler specific is the best way, although for current
> >> status where we support GCC 4.4 we still need hand crafted assembly for atomics.
> >> Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.
> > Note that I'm not arguing for using C11 at this point, but using atomics
> > that are very much like the atomic_*_explicit operations. The function
> > names can be different, but functionality, parameter ordering, etc.,
> > would be similar.
> >
> >> We can make what Joseph suggested and move minimum compiler required for GCC 4.7.
> >>
> >> But if we intend to actually add mininum GCC 4.7, I can work on changing all the
> >> atomics to GCC builtins.
> > Another option would be to only enable the new atomics that we don't
> > have right now (e.g., memory_order_relaxed) when building with 4.7 or
> > higher, and falling back to some of the custom assembly ones otherwise.
> > This would mean no performance regressions but the optimizations would
> > only be effective for 4.7 or higher. memory_order_relaxed loads/stores
> > could be an exception, because I think we need to make them explicity,
> > and we don't want to slow down code that now uses plain nonatomic
> > loads/stores in cases where it should actually be memory_order_relaxed
> > accesses.
> >
> >
> I don't like this conditional enable because it will require more logic
> to check for compiler version and add different code generation depending
> of this. If the idea is to simplify and cleanup code I think it is better
> to either add custom arch-specific code that will build regardless of
> compiler or use compiler builtins and get rid off hand-craft assembly.
> I see no point in enabling both.
*If* we can make 4.7 a requirement, then I'm certainly all for using
just the builtins on architectures where they are available and get
inlined.