This is the mail archive of the
mailing list for the glibc project.
Re: The direction of malloc?
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Wed, 18 Dec 2013 11:11:10 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb> <20131216212334 dot GA21284 at domone dot podge> <1387285197 dot 23049 dot 9075 dot camel at triegel dot csb> <20131217190817 dot GA32756 at domone dot podge> <1387324884 dot 23049 dot 10462 dot camel at triegel dot csb>
On Wed, Dec 18, 2013 at 01:01:24AM +0100, Torvald Riegel wrote:
> On Tue, 2013-12-17 at 20:08 +0100, OndÅej BÃlka wrote:
> > On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> > > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > > > Please explain how spinning could improve performance in single thread
> > > > applications.
> > >
> > > You spoke about lockless code, so obviously concurrent code. My comment
> > > was thus referring to concurrent code. If you have a single-threaded
> > > program, then you can avoid synchronization, obviously (ignoring
> > > synchronization just for reentrancy...).
> > >
> > And we for malloc use a switch variable to avoid lock path and set it
> > when pthread_create is called? For reentancy a ordinary variable suffices.
> Depending on the algorithm, even for reentrancy you might need atomic
> operations (eg, to keep under control what the compiler does with the
> code, or using a CAS to avoid pending stores).
> > > > >
> > > > > > Second problem is that fastbins are per-arena not per-thread which
> > > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > > >
> > > > > Especially on x86, atomic operations that *hit in the cache* have become
> > > > > very fast compared to their costs in the past. I don't have current
> > > > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > > > remember 10-20.
> > > >
> > > > A simple benchmark could check a real cost. A problem is that while for core2
> > > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> > >
> > > Sure, this differs per architecture. But recent Intel CPUs *are*
> > > common, so it's not quite correct to say that the latency (or
> > > throughput, depending on the algorithm) of an atomic RMW/CAS op is
> > > really the primary problem.
> > >
> > > > Even with these a malloc+free pair contains 5 atomic instructions on
> > > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > > > free: set bit that fastbins are in use, put item to bin)
> > >
> > > Maybe, but you still need to be careful when drawing conclusions from
> > > that because there are more performance effects in allocation than just
> > > the "cycle penalty" you mention (e.g., locality and other memory-system
> > > effects caused by differences in where data is placed).
> > >
> > Its not just that, pretty much every modern allocator uses some
> > per-thread cache so it looks like good idea, also oprofile shows lock
> > cmpxchg instruction as one of likely culprits. When I try a per-thread
> > cache that I posted earlier on a test that does just malloc and free it
> > nearly doubles performance.
> > There are several factors that come in play a first one is lack of
> > locking, second is getting expects correct, third is saving a call of
> > int_malloc. Then there are several cycles saved by omiting test for hook
> > and malloc_perturb.
> > Real implementation will be bit faster as dynamic tls slows this down a
> > bit.
> > Memory system effects are not a factor here, as allocation pattern is
> > identical (stack in both cases).
> I was referring to memory system effects when the data is actually
> accessed by an application. It does matter on which page allocated data
> ends up (and where on a page relative to other allocations).
And as it ends at same virtual address as algorithms are identical this
does not matter here.
> The speed
> of allocations is just one thing. For example, just to illustrate, if
> you realloc by copying to a different arena, this can slow down programs
> due to NUMA effects if those programs expect the allocation to likely
> remain on the same page after a realloc; such effects can be much more
> costly than a somewhat slower allocation.
You cannot optimize code for unlikely case. When a memory is allocated
in thread A and reallocated in thread B there could be three cases
1) Memory is passed to thread B which primarily access it.
2) Memory is shared between A and B.
3) Memory is primarily accessed by thread A.
As effect of cases 1) and 3) is symetrical it suffices to estimate which
one is more likely and case 1 seems a best candidate.
Realloc definitely does move in most cases as common usage pattern is
doubling size allocated and as we use best fit there is not enough room.