This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH][malloc] Avoid atomics in have_fastchunks
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Carlos O'Donell <carlos at redhat dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>, "dj at redhat dot com" <dj at redhat dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 19 Sep 2017 21:11:11 +0000
- Subject: Re: [PATCH][malloc] Avoid atomics in have_fastchunks
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- Nodisclaimer: True
- References: <DB6PR0801MB2053938869AF403F0D95BC5F83600@DB6PR0801MB2053.eurprd08.prod.outlook.com>,<d152332c-eded-aec7-f03b-efb4903f1670@redhat.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Carlos O'Donell wrote:
> It is great to see someone looking at the details of malloc at a atomic by
> atomic cost analysis. I know we have looked briefly at fastbins and the
> tradeoff between taking the arena lock (one atomic) and CAS required to put
> the fastbin back in the list.
Yes it looks like doing free lock-free works fine overall. I wonder whether
malloc can do the same for the fastbin paths since it already has to deal
with free updating the fastbins concurrently.
> You use unadorned loads and stores of the variable av->have_fastchunks, and
> this constitutes a data race which is undefined behaviour in C11.
...
> Please use relaxed MO loads and stores if that is what we need.
I'll do that.
> After you add the relaxed MO loads and stores the comment for have_fastchunks
> will need a little more explicit language about why the relaxed MO loads and
> stores are OK from a P&C perspective.
That's easy given multithreaded interleaving already allows all possible
combinations before even considering memory ordering - see my reply to
DJ for the long version...
> Does this patch change the number of times malloc_consolidate might
> be called? Do you have any figures on this? That would be a user visible
> change (and require a bug #).
The number of calls isn't fixed already. I'll have a go at hacking the malloc
test to see how much variation there is and whether my patch changes it.
Btw what is your opinion on how to add generic single-threaded optimizations
that work for all targets? Rather than doing more target hacks, I'd like to add
something similar like we did with stdio getc/putc, ie. add a high-level check for
the single-threaded case that uses a different code path (with no/relaxed atomics
and no locks for the common cases).
To give an idea how much this helps, creating a dummy thread that does nothing
slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...).
An alternative would be to move all the fastbin handling into the t-cache - but
then I bet it's much easier just to write a fast modern allocator...
Wilco