This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: The direction of malloc?
- From: Torvald Riegel <triegel at redhat dot com>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Wed, 18 Dec 2013 01:01:24 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb> <20131216212334 dot GA21284 at domone dot podge> <1387285197 dot 23049 dot 9075 dot camel at triegel dot csb> <20131217190817 dot GA32756 at domone dot podge>
On Tue, 2013-12-17 at 20:08 +0100, OndÅej BÃlka wrote:
> On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > > Please explain how spinning could improve performance in single thread
> > > applications.
> >
> > You spoke about lockless code, so obviously concurrent code. My comment
> > was thus referring to concurrent code. If you have a single-threaded
> > program, then you can avoid synchronization, obviously (ignoring
> > synchronization just for reentrancy...).
> >
> And we for malloc use a switch variable to avoid lock path and set it
> when pthread_create is called? For reentancy a ordinary variable suffices.
Depending on the algorithm, even for reentrancy you might need atomic
operations (eg, to keep under control what the compiler does with the
code, or using a CAS to avoid pending stores).
> > > >
> > > > > Second problem is that fastbins are per-arena not per-thread which
> > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > >
> > > > Especially on x86, atomic operations that *hit in the cache* have become
> > > > very fast compared to their costs in the past. I don't have current
> > > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > > remember 10-20.
> > >
> > > A simple benchmark could check a real cost. A problem is that while for core2
> > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> >
> > Sure, this differs per architecture. But recent Intel CPUs *are*
> > common, so it's not quite correct to say that the latency (or
> > throughput, depending on the algorithm) of an atomic RMW/CAS op is
> > really the primary problem.
> >
> > > Even with these a malloc+free pair contains 5 atomic instructions on
> > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > > free: set bit that fastbins are in use, put item to bin)
> >
> > Maybe, but you still need to be careful when drawing conclusions from
> > that because there are more performance effects in allocation than just
> > the "cycle penalty" you mention (e.g., locality and other memory-system
> > effects caused by differences in where data is placed).
> >
> Its not just that, pretty much every modern allocator uses some
> per-thread cache so it looks like good idea, also oprofile shows lock
> cmpxchg instruction as one of likely culprits. When I try a per-thread
> cache that I posted earlier on a test that does just malloc and free it
> nearly doubles performance.
>
> There are several factors that come in play a first one is lack of
> locking, second is getting expects correct, third is saving a call of
> int_malloc. Then there are several cycles saved by omiting test for hook
> and malloc_perturb.
>
> Real implementation will be bit faster as dynamic tls slows this down a
> bit.
>
> Memory system effects are not a factor here, as allocation pattern is
> identical (stack in both cases).
I was referring to memory system effects when the data is actually
accessed by an application. It does matter on which page allocated data
ends up (and where on a page relative to other allocations). The speed
of allocations is just one thing. For example, just to illustrate, if
you realloc by copying to a different arena, this can slow down programs
due to NUMA effects if those programs expect the allocation to likely
remain on the same page after a realloc; such effects can be much more
costly than a somewhat slower allocation.