This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access

From: Carlos O'Donell <carlos at redhat dot com>
To: Szabolcs Nagy <Szabolcs dot Nagy at arm dot com>, Florian Weimer <fweimer at redhat dot com>
Cc: nd <nd at arm dot com>, Anton Blanchard <anton at ozlabs dot org>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, Tulio Magno Quites Machado Filho <tuliom at linux dot vnet dot ibm dot com>, Paul Clarke <pc at us dot ibm dot com>, Bill Schmidt <wschmidt at us dot ibm dot com>
Date: Wed, 16 Jan 2019 11:55:13 -0500
Subject: Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
References: <87va52nupb.fsf@oldenburg.str.redhat.com> <20190116092655.151bdbbd@kryten> <87won5zgz5.fsf@oldenburg2.str.redhat.com> <20190116171838.25c612fd@kryten> <87h8e8zsx3.fsf@oldenburg2.str.redhat.com> <af648b73-4b9a-41e7-df2d-37891ac169a3@redhat.com> <871s5cwpvj.fsf@oldenburg2.str.redhat.com> <3aeb56e7-739f-aea8-c0f2-41704837e732@redhat.com> <3e24a1ec-f16c-5730-7663-09a8afba4550@arm.com>

On 1/16/19 11:52 AM, Szabolcs Nagy wrote:
> On 16/01/2019 16:30, Carlos O'Donell wrote:
>> On 1/16/19 11:04 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> On 1/16/19 7:30 AM, Florian Weimer wrote:
>>>>> * Anton Blanchard:
>>>>>
>>>>>> Hi Florian,
>>>>>>
>>>>>>>> I see a 16% regression on ppc64le with a simple threaded malloc test
>>>>>>>> case. I guess the C11 atomics aren't as good as what we have in
>>>>>>>> glibc.  
>>>>>>>
>>>>>>> Uh-oh.  Would you please check if replacing the two
>>>>>>> atomic_load_acquire with atomic_load_relaxed restore the previous
>>>>>>> performance?
>>>>>>
>>>>>> As you suspect, doing this does restore the performance. The two lwsync
>>>>>> barrier instructions must be causing the slow down.
>>>>>
>>>>> Okay, I'll post a patch to revert that commit.
>>>>>
>>>>> However, the old code had this in _int_malloc (where the arena lock is
>>>>> acquired):
>>>>>
>>>>> #define REMOVE_FB(fb, victim, pp)                       \
>>>>>   do                                                    \
>>>>>     {                                                   \
>>>>>       victim = pp;                                      \
>>>>>       if (victim == NULL)                               \
>>>>>         break;                                          \
>>>>>     }                                                   \
>>>>>   while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
>>>>>          != victim);                                    \
>>>>> …
>>>>>             REMOVE_FB (fb, pp, victim);
>>>>>
>>>>> And this in _int_free (without the arena lock):
>>>>>
>>>>>       do
>>>>>         {
>>>>>           /* Check that the top of the bin is not the record we are going to
>>>>>              add (i.e., double free).  */
>>>>>           if (__builtin_expect (old == p, 0))
>>>>>             malloc_printerr ("double free or corruption (fasttop)");
>>>>>           p->fd = old2 = old;
>>>>>         }
>>>>>       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>>>>>              != old2);
>>>>>
>>>>> I really don't see what makes sure that the store of p->fd happens
>>>>> before the load of victim->fd.
>>>>>
>>>>> It works out on POWER for some reason.  I'm attaching a test case that
>>>>> should exercise these two code paths.
>>>>
>>>> This is still not a root cause analysis of the problem.
>>>
>>> We know the root of the performance problem, the additional
>>> synchronization, and, on Aarch64, a consequence of what was supposed to
>>> be a minor algorithmic change.
>>
>> OK, I hadn't seen an analysis of the instruction sequences, so I wasn't
>> sure if this was completely understood.
>>
>> This begs the question: Is the existing code correct for Aarch64 and POWER64?
>>
>> I guess we don't know without serious review.
>>
>>>> I think it is OK to revert your patch under the pressure of the upcoming
>>>> release and the fact that this is a serious performance regression.
>>>>
>>>> However, there is some due diligence that needs to be done here, and I
>>>> think the patch should *right back in* as soon as master opens. We already
>>>> have *lots* of other code using C11-style atomics in glibc, and we want
>>>> to make sure those work as best we can. This issue is either one of two
>>>> things:
>>>>
>>>> * non-C11 atomics are *wrong* for aarch64/power64 and the fastbins can
>>>>   eventually become corrupted or worse see unbounded growth as we
>>>>   corrupt the chunk list.
>>>>
>>>> or
>>>>
>>>> * non-C11 atomics are more optimal for C11 atomics and we can adjust
>>>>   the C11 atomics to be as good as the non-C11 versions with a little
>>>>   help from Arm and IBM.
>>>
>>> It's also possible there's a different algorithm that's fast to
>>> implement with C11 atomics.
>>
>> Or the existing C11-style atomics for Aarch64 and POWER64 are pessimistic
>> and we might do better.
>>
>>> I'm not a concurrency expert, but I really don't see a way to avoid the
>>> additional barriers with the current algorithm.  It would also be worth
>>> checking if fastbins are even a win on these architectures.
>>
>> Right, it needs a review.
>>
>> At least on x86_64 when we did the tcache vs. fastbin testing the addition
>> of fastbin was still a win.
>>
>> We'd have to run the simulator and workload on Aarch64 to see, which we
>> can ask Arm to do... and IBM for POWER64.
>>
> 
> we will look into the regression.
> 
> it is not yet clear what causes it.
> 

Awesome. Thank you for that. To be clear my worry is that we're using 
C11-style atomics all over the place in glibc now, and if they are not 
correct or really slow we should understand why.

This particular change is OK for for me, I fully understand you want
the performance back to level it was at before.

-- 
Cheers,
Carlos.

References:
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Anton Blanchard
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Florian Weimer
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Anton Blanchard
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Florian Weimer
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Carlos O'Donell
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Florian Weimer
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Carlos O'Donell
- Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access
  - From: Szabolcs Nagy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]