This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
On 16/01/2019 16:30, Carlos O'Donell wrote:
> On 1/16/19 11:04 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>>
>>> On 1/16/19 7:30 AM, Florian Weimer wrote:
>>>> * Anton Blanchard:
>>>>
>>>>> Hi Florian,
>>>>>
>>>>>>> I see a 16% regression on ppc64le with a simple threaded malloc test
>>>>>>> case. I guess the C11 atomics aren't as good as what we have in
>>>>>>> glibc.
>>>>>>
>>>>>> Uh-oh. Would you please check if replacing the two
>>>>>> atomic_load_acquire with atomic_load_relaxed restore the previous
>>>>>> performance?
>>>>>
>>>>> As you suspect, doing this does restore the performance. The two lwsync
>>>>> barrier instructions must be causing the slow down.
>>>>
>>>> Okay, I'll post a patch to revert that commit.
>>>>
>>>> However, the old code had this in _int_malloc (where the arena lock is
>>>> acquired):
>>>>
>>>> #define REMOVE_FB(fb, victim, pp) \
>>>> do \
>>>> { \
>>>> victim = pp; \
>>>> if (victim == NULL) \
>>>> break; \
>>>> } \
>>>> while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
>>>> != victim); \
>>>> …
>>>> REMOVE_FB (fb, pp, victim);
>>>>
>>>> And this in _int_free (without the arena lock):
>>>>
>>>> do
>>>> {
>>>> /* Check that the top of the bin is not the record we are going to
>>>> add (i.e., double free). */
>>>> if (__builtin_expect (old == p, 0))
>>>> malloc_printerr ("double free or corruption (fasttop)");
>>>> p->fd = old2 = old;
>>>> }
>>>> while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>>>> != old2);
>>>>
>>>> I really don't see what makes sure that the store of p->fd happens
>>>> before the load of victim->fd.
>>>>
>>>> It works out on POWER for some reason. I'm attaching a test case that
>>>> should exercise these two code paths.
>>>
>>> This is still not a root cause analysis of the problem.
>>
>> We know the root of the performance problem, the additional
>> synchronization, and, on Aarch64, a consequence of what was supposed to
>> be a minor algorithmic change.
>
> OK, I hadn't seen an analysis of the instruction sequences, so I wasn't
> sure if this was completely understood.
>
> This begs the question: Is the existing code correct for Aarch64 and POWER64?
>
> I guess we don't know without serious review.
>
>>> I think it is OK to revert your patch under the pressure of the upcoming
>>> release and the fact that this is a serious performance regression.
>>>
>>> However, there is some due diligence that needs to be done here, and I
>>> think the patch should *right back in* as soon as master opens. We already
>>> have *lots* of other code using C11-style atomics in glibc, and we want
>>> to make sure those work as best we can. This issue is either one of two
>>> things:
>>>
>>> * non-C11 atomics are *wrong* for aarch64/power64 and the fastbins can
>>> eventually become corrupted or worse see unbounded growth as we
>>> corrupt the chunk list.
>>>
>>> or
>>>
>>> * non-C11 atomics are more optimal for C11 atomics and we can adjust
>>> the C11 atomics to be as good as the non-C11 versions with a little
>>> help from Arm and IBM.
>>
>> It's also possible there's a different algorithm that's fast to
>> implement with C11 atomics.
>
> Or the existing C11-style atomics for Aarch64 and POWER64 are pessimistic
> and we might do better.
>
>> I'm not a concurrency expert, but I really don't see a way to avoid the
>> additional barriers with the current algorithm. It would also be worth
>> checking if fastbins are even a win on these architectures.
>
> Right, it needs a review.
>
> At least on x86_64 when we did the tcache vs. fastbin testing the addition
> of fastbin was still a win.
>
> We'd have to run the simulator and workload on Aarch64 to see, which we
> can ask Arm to do... and IBM for POWER64.
>
we will look into the regression.
it is not yet clear what causes it.