This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][malloc] Avoid atomics in have_fastchunks


Carlos O'Donell wrote:

> It is great to see someone looking at the details of malloc at a atomic by
> atomic cost analysis. I know we have looked briefly at fastbins and the
> tradeoff between taking the arena lock (one atomic) and CAS required to put
> the fastbin back in the list.

Yes it looks like doing free lock-free works fine overall. I wonder whether
malloc can do the same for the fastbin paths since it already has to deal
with free updating the fastbins concurrently.

> You use unadorned loads and stores of the variable av->have_fastchunks, and
> this constitutes a data race which is undefined behaviour in C11.
...
> Please use relaxed MO loads and stores if that is what we need.

I'll do that.

> After you add the relaxed MO loads and stores the comment for have_fastchunks
> will need a little more explicit language about why the relaxed MO loads and
> stores are OK from a P&C perspective.

That's easy given multithreaded interleaving already allows all possible
combinations before even considering memory ordering - see my reply to
DJ for the long version...

> Does this patch change the number of times malloc_consolidate might
> be called? Do you have any figures on this? That would be a user visible
> change (and require a bug #).

The number of calls isn't fixed already. I'll have a go at hacking the malloc
test to see how much variation there is and whether my patch changes it.


Btw what is your opinion on how to add generic single-threaded optimizations
that work for all targets? Rather than doing more target hacks, I'd like to add
something similar like we did with stdio getc/putc, ie. add a high-level check for
the single-threaded case that uses a different code path (with no/relaxed atomics
and no locks for the common cases).

To give an idea how much this helps, creating a dummy thread that does nothing
slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...).

An alternative would be to move all the fastbin handling into the t-cache - but
then I bet it's much easier just to write a fast modern allocator...

Wilco


    

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]