This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: The direction of malloc?
- From: Torvald Riegel <triegel at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Mon, 16 Dec 2013 17:59:00 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge>
On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote:
> We have fastbins that sorta do this but with several problems.
> 1. They are not really lockless, for malloc they need a lock, only
> freeing will be when bug 15073 gets fixed.
Note that *properly tuned* locks can be as fast as lock-less code in
several cases. When you use a, say, stock PTHREAD_MUTEX_NORMAL lock,
this won't be well-suited to short critical sections because it doesn't
spin. PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning
iterations seems to be an arbitrary choice, and isn't very useful in my
experiments so far. Also, there's no back-off whatsoever in the
spinning.
Before going off into replacing a concurrent code using our stock locks
with a concurrent code that is lockless, I'd suggest trying different(ly
tuned) locks first. Probably start with a full spinlock, and add some
back-off to try get the cache misses down.
If the concurrent code is mostly running on it's own, an efficient lock
implementation can be faster than an equivalent lockless
implementations, especially if all calls are modifying memory anyway and
don't just need to read many things not protected by a single lock also
protecting the data that needs to be modified.
> Second problem is that fastbins are per-arena not per-thread which
> forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
Especially on x86, atomic operations that *hit in the cache* have become
very fast compared to their costs in the past. I don't have current
numbers, but I believe the 50 cycle number is too high; I vaguely
remember 10-20.