This is the mail archive of the
mailing list for the glibc project.
Re: The direction of malloc?
- From: Torvald Riegel <triegel at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Tue, 17 Dec 2013 13:59:57 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb> <20131216212334 dot GA21284 at domone dot podge>
On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> On Mon, Dec 16, 2013 at 05:59:00PM +0100, Torvald Riegel wrote:
> > On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote:
> > > We have fastbins that sorta do this but with several problems.
> > > 1. They are not really lockless, for malloc they need a lock, only
> > > freeing will be when bug 15073 gets fixed.
> > Note that *properly tuned* locks can be as fast as lock-less code in
> > several cases. When you use a, say, stock PTHREAD_MUTEX_NORMAL lock,
> > this won't be well-suited to short critical sections because it doesn't
> > spin. PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning
> > iterations seems to be an arbitrary choice, and isn't very useful in my
> > experiments so far. Also, there's no back-off whatsoever in the
> > spinning.
> As will said acquiring a lock is hot part in a single thread
You said you wanted lock-less code. You didn't say you want
nonconcurrent code. Thus there is concurrency, and you need to
synchronize in some way. If it involves modifications, you need a CAS
or other RMW op of some kind, typically, independent of whether you use
a lock or not. Thus, the lock can have as many or fewer atomic RMW/CAS
ops than other concurrent code. That's what I pointed out.
> Please explain how spinning could improve performance in single thread
You spoke about lockless code, so obviously concurrent code. My comment
was thus referring to concurrent code. If you have a single-threaded
program, then you can avoid synchronization, obviously (ignoring
synchronization just for reentrancy...).
> > > Second problem is that fastbins are per-arena not per-thread which
> > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > Especially on x86, atomic operations that *hit in the cache* have become
> > very fast compared to their costs in the past. I don't have current
> > numbers, but I believe the 50 cycle number is too high; I vaguely
> > remember 10-20.
> A simple benchmark could check a real cost. A problem is that while for core2
> and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
Sure, this differs per architecture. But recent Intel CPUs *are*
common, so it's not quite correct to say that the latency (or
throughput, depending on the algorithm) of an atomic RMW/CAS op is
really the primary problem.
> Even with these a malloc+free pair contains 5 atomic instructions on
> fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> free: set bit that fastbins are in use, put item to bin)
Maybe, but you still need to be careful when drawing conclusions from
that because there are more performance effects in allocation than just
the "cycle penalty" you mention (e.g., locality and other memory-system
effects caused by differences in where data is placed).
BTW, in general, you can't just add "cycles" like that unless you have a
very simple CPU.