This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH 1/2] Single thread optimization for malloc atomics
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: "GNU C. Library" <libc-alpha at sourceware dot org>
- Date: Mon, 05 May 2014 13:59:28 -0300
- Subject: Re: [PATCH 1/2] Single thread optimization for malloc atomics
- Authentication-results: sourceware.org; auth=none
- References: <53610133 dot 3070908 at linux dot vnet dot ibm dot com> <1399041298 dot 32485 dot 6373 dot camel at triegel dot csb> <5363DB7D dot 5080805 at linux dot vnet dot ibm dot com> <1399285032 dot 32485 dot 7075 dot camel at triegel dot csb>
On 05-05-2014 07:17, Torvald Riegel wrote:
> On Fri, 2014-05-02 at 14:53 -0300, Adhemerval Zanella wrote:
>> On 02-05-2014 11:34, Torvald Riegel wrote:
>>> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
>>>> This patch adds a single-thread optimization for malloc atomic usage to
>>>> first check if process is single-thread (ST) and if so use normal
>>>> load/store instead of atomic instructions.
>>>> It is a respin of my initial attempt to add ST optimization on PowerPC.
>>>> Now malloc optimization is arch-agnostic and rely in already defined macros
>>>> in GLIBC (SINGLE_THREAD_P).
>>> I'm not convinced that this is a good change. First, I'd like to see
>>> some performance numbers that show whether the change leads to a
>>> non-negligible increase in performance. That should be a
>>> microbenchmark, so we can track it over time.
>> In fact I would prefer to make it more arch-specific and let the the arch-maintainers
>> evaluate if such change would be preferable. One way to do it is adding a
>> new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as
>> catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h. Then arch can be replace
>> its implementation if it suit them.
> I agree that there is arch-specific input to this optimization problem,
> but that doesn't mean that the solution to it will be primarily
> The arch-specific input we have is:
> * Whether an atomic op (probably we'll be considering just RMW ops and
> CAS) is significantly slower than non-atomic code plus a branch. For
> example, the difference should be small on recent x86, but I assume it's
> higher on PowerPC. We'd need a microbenchmark of some sort to track the
> decision we make here.
> * Whether an atomic op that's MT-Safe and AS-Safe is significantly
> slower than an atomic op that's just AS-Safe. That needs a
> microbenchmark as well. For example, x86 currently assumes this is the
> case, but it's something we should revisit I guess.
Which motivated me to propose such change is the malloc code for single-thread.
To give you some numbers, currently on POWER7/3GHz using a simple loop to allocate
32 bytes, it takes about 130ns to each call. Profilers indicate that indeed the
culprit of such long time is the locks being used even in single-thread.
Using my first approach (sequential code for single-thread, no as-safe) the code
the same calls is not done in about 40ns. By just tuning the locks (not the atomics)
it raises to 90ns.
And yes, I'm aware that issue about being non AS-safe. However, currently the CAS
primitives, even for AS-sync one with memory barriers, is done with LL/SC instructions
(ldarx. / stdcx.). And for single-thread case, the memory barriers are not causing
much latency: I check by just removing then (by using CAS primitives without them)
and it did not make difference for performance, the problem is the use of LL/SC
instructions themselves. These are true at least for POWER6, POWER7, and POWER8.
> * If the former is the case, we need arch-specific atomics (i.e.,
> catomic_*) to make use of that.
> Thus, we'd end up with two constants and potentially some arch-specific
> variants. We could differentiate the constants as necessary per user
> (e.g., make one that's malloc-specific), but I'd prefer not to, and that
> can be done in the future.
> With that in place, we can look at the atomics users and see how they
> could exploit AS-Safe or AS-Unsafe+MT-Unsafe scenarios. We'd have a
> normal branch and let the compiler remove the dead code (we can use the
> preprocessor too, but I think that's worse).
> That allows us to really look at what we can do differently in a
> sequential-execution scenario -- so going beyond just tweaking one
> atomic op at a time.
> If we really need the single-atomic-op optimizations often, we can
> consider adding something like catomic. That is, if, for example,
> if (as_safe_atomics_faster) x+=1; else atomic_add(&x, 1);
> is needed often, we can add something like catomic.
> We might want to revisit current uses of catomic too.
That's why I would to propose archs to be able to override the atomic in malloc
code with an instruction sequence to check for multithread and use a code
sequence that is non AS-Safe. I know that it will impose an additional
consideration for AS-safeness in malloc code, however since it is not AS-safe
right now this arch-specific modification is safe. And if in the future, if
we even make malloc AS-Safe, this modification will sure need to be reevaluated
or even drop altogether.
But I concede that this is potentially misleading change, so indeed it will need
some discussions. As Ondrej has proposed, the best course of action would rework
malloc internal altogether to add something like thread-caches to avoid lock
contention (even for single-thread); however for *current* code these locks are
real issue for POWER and we from POWER toolchain really want a way to circumvent it,
if possible. I would understand that if has been decided that the engineer cost
does not worth the gains.
>>> malloc is currently marked as Async-Signal-Unsafe, but due to locks
>>> elsewhere in the implementation. The MTSafety docs even talk about what
>>> would be needed to make it AS-Safe. Your patch would prevent making it
>>> AS-Safe. The catomic_add that you replace are AS-Safe, the sequential
>>> code you add is not. Someone added those catomic_* and not sequential
>>> code, so this is at least indication that there's a disconnect here.
>> Another point to make this change arch specific.
> Why? The motivation to use catomic might have been driven by
> arch-specific properties, but you want to replace it with something
> functionally different.
I would like to replace the atomics in malloc code and libc locks/unlock only for POWER,
since they are not currently AS-safe. But as I said, I will concede if community decided
it won't worth the engineer effort.
>> And if we even make malloc
>> functions AS-Safe in the future, the arch maintainer just need to reevaluate
>> this change (since it will either make the optimization not required or make
>> it unsafe).
> Whether malloc is AS-Safe or not should never be arch-specific IMO. If
> we use the constants as I outlined above, the malloc implementer can
> just build code for the different HW performance scenarios, and all the
> archs have to do is set the flags accordingly.
> We could even further reduce the implementation overhead for the AS-Safe
> MT-Unsafe atomics if using the newer GCC builtins for the C11 memory
> model by adding support for as special memory order that just provides
For POWER and single-thread scenario it not the memory barriers along that LL/SC
instruction that are generating latency, but the LL/SC instructions. Even using
C11 memory model won't help in this case.
>>> You also add macros with nondescriptive names, that each have exactly
>>> one use. There's no documentation for how they differ, and you don't
>>> update the comments on the MT-Safe docs.
>>>> x86 probably would kike an atomic.h refactor
>>>> to avoid the double check for '__local_multiple_threads', but that is not
>>>> the aim of this patch.
>>> But your patch does introduce this, so you should take care of this as
>>> well. I guess you wouldn't be happy if someone change x86 without
>>> caring about Power?
>> I will drop this patch and rework on a arch-specific one.
> What do you think about the plan I outlined above?