This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
- From: Florian Weimer <fweimer at redhat dot com>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: Anton Blanchard <anton at ozlabs dot org>, libc-alpha at sourceware dot org, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>, Paul Clarke <pc at us dot ibm dot com>, Bill Schmidt <wschmidt at us dot ibm dot com>
- Date: Wed, 16 Jan 2019 17:04:16 +0100
- Subject: Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
- References: <87va52nupb.fsf@oldenburg.str.redhat.com> <20190116092655.151bdbbd@kryten> <87won5zgz5.fsf@oldenburg2.str.redhat.com> <20190116171838.25c612fd@kryten> <87h8e8zsx3.fsf@oldenburg2.str.redhat.com> <af648b73-4b9a-41e7-df2d-37891ac169a3@redhat.com>
* Carlos O'Donell:
> On 1/16/19 7:30 AM, Florian Weimer wrote:
>> * Anton Blanchard:
>>
>>> Hi Florian,
>>>
>>>>> I see a 16% regression on ppc64le with a simple threaded malloc test
>>>>> case. I guess the C11 atomics aren't as good as what we have in
>>>>> glibc.
>>>>
>>>> Uh-oh. Would you please check if replacing the two
>>>> atomic_load_acquire with atomic_load_relaxed restore the previous
>>>> performance?
>>>
>>> As you suspect, doing this does restore the performance. The two lwsync
>>> barrier instructions must be causing the slow down.
>>
>> Okay, I'll post a patch to revert that commit.
>>
>> However, the old code had this in _int_malloc (where the arena lock is
>> acquired):
>>
>> #define REMOVE_FB(fb, victim, pp) \
>> do \
>> { \
>> victim = pp; \
>> if (victim == NULL) \
>> break; \
>> } \
>> while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
>> != victim); \
>> …
>> REMOVE_FB (fb, pp, victim);
>>
>> And this in _int_free (without the arena lock):
>>
>> do
>> {
>> /* Check that the top of the bin is not the record we are going to
>> add (i.e., double free). */
>> if (__builtin_expect (old == p, 0))
>> malloc_printerr ("double free or corruption (fasttop)");
>> p->fd = old2 = old;
>> }
>> while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>> != old2);
>>
>> I really don't see what makes sure that the store of p->fd happens
>> before the load of victim->fd.
>>
>> It works out on POWER for some reason. I'm attaching a test case that
>> should exercise these two code paths.
>
> This is still not a root cause analysis of the problem.
We know the root of the performance problem, the additional
synchronization, and, on Aarch64, a consequence of what was supposed to
be a minor algorithmic change.
> I think it is OK to revert your patch under the pressure of the upcoming
> release and the fact that this is a serious performance regression.
>
> However, there is some due diligence that needs to be done here, and I
> think the patch should *right back in* as soon as master opens. We already
> have *lots* of other code using C11-style atomics in glibc, and we want
> to make sure those work as best we can. This issue is either one of two
> things:
>
> * non-C11 atomics are *wrong* for aarch64/power64 and the fastbins can
> eventually become corrupted or worse see unbounded growth as we
> corrupt the chunk list.
>
> or
>
> * non-C11 atomics are more optimal for C11 atomics and we can adjust
> the C11 atomics to be as good as the non-C11 versions with a little
> help from Arm and IBM.
It's also possible there's a different algorithm that's fast to
implement with C11 atomics.
I'm not a concurrency expert, but I really don't see a way to avoid the
additional barriers with the current algorithm. It would also be worth
checking if fastbins are even a win on these architectures.
Thanks,
Florian