This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: The direction of malloc?
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>, libc-alpha at sourceware dot org
- Date: Mon, 16 Dec 2013 22:23:34 +0100
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge> <52A75502 dot 6040500 at linux dot vnet dot ibm dot com> <20131210210541 dot GA19161 at domone dot podge> <1387213140 dot 23049 dot 8010 dot camel at triegel dot csb>
On Mon, Dec 16, 2013 at 05:59:00PM +0100, Torvald Riegel wrote:
> On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote:
> > We have fastbins that sorta do this but with several problems.
> > 1. They are not really lockless, for malloc they need a lock, only
> > freeing will be when bug 15073 gets fixed.
>
> Note that *properly tuned* locks can be as fast as lock-less code in
> several cases. When you use a, say, stock PTHREAD_MUTEX_NORMAL lock,
> this won't be well-suited to short critical sections because it doesn't
> spin. PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning
> iterations seems to be an arbitrary choice, and isn't very useful in my
> experiments so far. Also, there's no back-off whatsoever in the
> spinning.
>
As will said acquiring a lock is hot part in a single thread
applications.
Please explain how spinning could improve performance in single thread
applications.
>
> > Second problem is that fastbins are per-arena not per-thread which
> > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
>
> Especially on x86, atomic operations that *hit in the cache* have become
> very fast compared to their costs in the past. I don't have current
> numbers, but I believe the 50 cycle number is too high; I vaguely
> remember 10-20.
A simple benchmark could check a real cost. A problem is that while for core2
and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
Even with these a malloc+free pair contains 5 atomic instructions on
fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
free: set bit that fastbins are in use, put item to bin)