This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] malloc: Use current (C11-style) atomics for fastbin access


On 1/16/19 7:30 AM, Florian Weimer wrote:
> * Anton Blanchard:
> 
>> Hi Florian,
>>
>>>> I see a 16% regression on ppc64le with a simple threaded malloc test
>>>> case. I guess the C11 atomics aren't as good as what we have in
>>>> glibc.  
>>>
>>> Uh-oh.  Would you please check if replacing the two
>>> atomic_load_acquire with atomic_load_relaxed restore the previous
>>> performance?
>>
>> As you suspect, doing this does restore the performance. The two lwsync
>> barrier instructions must be causing the slow down.
> 
> Okay, I'll post a patch to revert that commit.
> 
> However, the old code had this in _int_malloc (where the arena lock is
> acquired):
> 
> #define REMOVE_FB(fb, victim, pp)                       \
>   do                                                    \
>     {                                                   \
>       victim = pp;                                      \
>       if (victim == NULL)                               \
>         break;                                          \
>     }                                                   \
>   while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) \
>          != victim);                                    \
> …
>             REMOVE_FB (fb, pp, victim);
> 
> And this in _int_free (without the arena lock):
> 
>       do
>         {
>           /* Check that the top of the bin is not the record we are going to
>              add (i.e., double free).  */
>           if (__builtin_expect (old == p, 0))
>             malloc_printerr ("double free or corruption (fasttop)");
>           p->fd = old2 = old;
>         }
>       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>              != old2);
> 
> I really don't see what makes sure that the store of p->fd happens
> before the load of victim->fd.
> 
> It works out on POWER for some reason.  I'm attaching a test case that
> should exercise these two code paths.

This is still not a root cause analysis of the problem.

I think it is OK to revert your patch under the pressure of the upcoming
release and the fact that this is a serious performance regression.

However, there is some due diligence that needs to be done here, and I
think the patch should *right back in* as soon as master opens. We already
have *lots* of other code using C11-style atomics in glibc, and we want
to make sure those work as best we can. This issue is either one of two
things:

* non-C11 atomics are *wrong* for aarch64/power64 and the fastbins can
  eventually become corrupted or worse see unbounded growth as we
  corrupt the chunk list.

or

* non-C11 atomics are more optimal for C11 atomics and we can adjust
  the C11 atomics to be as good as the non-C11 versions with a little
  help from Arm and IBM.

We have to be able to prove the fastbin access is correct.

-- 
Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]