This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?


On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote:
> On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote:
> > On Mon, Dec 16, 2013 at 05:59:00PM +0100, Torvald Riegel wrote:
> > > On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote:
> > > > We have fastbins that sorta do this but with several problems. 
> > > > 1. They are not really lockless, for malloc they need a lock, only
> > > > freeing will be when bug 15073 gets fixed.
> > > 
> > > Note that *properly tuned* locks can be as fast as lock-less code in
> > > several cases.  When you use a, say, stock PTHREAD_MUTEX_NORMAL lock,
> > > this won't be well-suited to short critical sections because it doesn't
> > > spin.  PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning
> > > iterations seems to be an arbitrary choice, and isn't very useful in my
> > > experiments so far.  Also, there's no back-off whatsoever in the
> > > spinning.
> > >
> > As will said acquiring a lock is hot part in a single thread
> > applications.
> 
> You said you wanted lock-less code.  You didn't say you want
> nonconcurrent code.  Thus there is concurrency, and you need to
> synchronize in some way.  If it involves modifications, you need a CAS
> or other RMW op of some kind, typically, independent of whether you use
> a lock or not.  Thus, the lock can have as many or fewer atomic RMW/CAS
> ops than other concurrent code.  That's what I pointed out.
> 
There are different degrees of synchronization that may be required. In
malloc sources of concurency are having many threads, free from
different thread and realloc from different thread.

A first one could be treated by a cache, for a free we could make a
synchronization overhead arbitrary low by waiting until we accumulate n
bytes for given arena and only then return these and I think that for
realloc best course of action is just copy data to our arena as I posted
earlier.

> > Please explain how spinning could improve performance in single thread
> > applications.
> 
> You spoke about lockless code, so obviously concurrent code.  My comment
> was thus referring to concurrent code.  If you have a single-threaded
> program, then you can avoid synchronization, obviously (ignoring
> synchronization just for reentrancy...).
>
And we for malloc use a switch variable to avoid lock path and set it
when pthread_create is called? For reentancy a ordinary variable suffices.

> > > 
> > > > Second problem is that fastbins are per-arena not per-thread which
> > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).
> > > 
> > > Especially on x86, atomic operations that *hit in the cache* have become
> > > very fast compared to their costs in the past.  I don't have current
> > > numbers, but I believe the 50 cycle number is too high; I vaguely
> > > remember 10-20.
> > 
> > A simple benchmark could check a real cost. A problem is that while for core2
> > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles.
> 
> Sure, this differs per architecture.  But recent Intel CPUs *are*
> common, so it's not quite correct to say that the latency (or
> throughput, depending on the algorithm) of an atomic RMW/CAS op is
> really the primary problem.
> 
> > Even with these a malloc+free pair contains 5 atomic instructions on
> > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock,
> > free: set bit that fastbins are in use, put item to bin) 
> 
> Maybe, but you still need to be careful when drawing conclusions from
> that because there are more performance effects in allocation than just
> the "cycle penalty" you mention (e.g., locality and other memory-system
> effects caused by differences in where data is placed).
>
Its not just that, pretty much every modern allocator uses some
per-thread cache so it looks like good idea, also oprofile shows lock
cmpxchg instruction as one of likely culprits. When I try a per-thread
cache that I posted earlier on a test that does just malloc and free it
nearly doubles performance.

There are several factors that come in play a first one is lack of
locking, second is getting expects correct, third is saving a call of
int_malloc. Then there are several cycles saved by omiting test for hook
and malloc_perturb.

Real implementation will be bit faster as dynamic tls slows this down a
bit.

Memory system effects are not a factor here, as allocation pattern is
identical (stack in both cases).

For benchmark use following commands:

$ gcc test.c -O3 -o test
$ gcc malloc_cache.c -O3 -fPIC -shared -mtls-dialect=gnu2 -lpthread -ldl -o malloc_cache.so

$ /usr/bin/time ./a.out 
Command exited with non-zero status 32
0.29user 0.00system 0:00.29elapsed 99%CPU (0avgtext+0avgdata
512maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps
$ /usr/bin/time ./a.out 
Command exited with non-zero status 32
0.30user 0.00system 0:00.30elapsed 99%CPU (0avgtext+0avgdata
512maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps
$ /usr/bin/time ./a.out 
Command exited with non-zero status 32
0.30user 0.00system 0:00.31elapsed 99%CPU (0avgtext+0avgdata
512maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

$ LD_PRELOAD=./malloc_cache.so /usr/bin/time
./a.out 
Command exited with non-zero status 32
0.19user 0.04system 0:00.23elapsed 98%CPU (0avgtext+0avgdata
117796maxresident)k
0inputs+0outputs (0major+29503minor)pagefaults 0swaps
$ LD_PRELOAD=./malloc_cache.so /usr/bin/time
./a.out 
Command exited with non-zero status 32
0.17user 0.04system 0:00.22elapsed 99%CPU (0avgtext+0avgdata
117796maxresident)k
0inputs+0outputs (0major+29503minor)pagefaults 0swaps
$ LD_PRELOAD=./malloc_cache.so /usr/bin/time
./a.out 
Command exited with non-zero status 32
0.12user 0.06system 0:00.19elapsed 99%CPU (0avgtext+0avgdata
117796maxresident)k
0inputs+0outputs (0major+29503minor)pagefaults 0swaps

Attachment: test.c
Description: Text document

Attachment: malloc_cache.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]