This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Tue, Dec 17, 2013 at 01:59:57PM +0100, Torvald Riegel wrote: > On Mon, 2013-12-16 at 22:23 +0100, OndÅej BÃlka wrote: > > On Mon, Dec 16, 2013 at 05:59:00PM +0100, Torvald Riegel wrote: > > > On Tue, 2013-12-10 at 22:05 +0100, OndÅej BÃlka wrote: > > > > We have fastbins that sorta do this but with several problems. > > > > 1. They are not really lockless, for malloc they need a lock, only > > > > freeing will be when bug 15073 gets fixed. > > > > > > Note that *properly tuned* locks can be as fast as lock-less code in > > > several cases. When you use a, say, stock PTHREAD_MUTEX_NORMAL lock, > > > this won't be well-suited to short critical sections because it doesn't > > > spin. PTHREAD_MUTEX_ADAPTIVE_NP spins, but the number of spinning > > > iterations seems to be an arbitrary choice, and isn't very useful in my > > > experiments so far. Also, there's no back-off whatsoever in the > > > spinning. > > > > > As will said acquiring a lock is hot part in a single thread > > applications. > > You said you wanted lock-less code. You didn't say you want > nonconcurrent code. Thus there is concurrency, and you need to > synchronize in some way. If it involves modifications, you need a CAS > or other RMW op of some kind, typically, independent of whether you use > a lock or not. Thus, the lock can have as many or fewer atomic RMW/CAS > ops than other concurrent code. That's what I pointed out. > There are different degrees of synchronization that may be required. In malloc sources of concurency are having many threads, free from different thread and realloc from different thread. A first one could be treated by a cache, for a free we could make a synchronization overhead arbitrary low by waiting until we accumulate n bytes for given arena and only then return these and I think that for realloc best course of action is just copy data to our arena as I posted earlier. > > Please explain how spinning could improve performance in single thread > > applications. > > You spoke about lockless code, so obviously concurrent code. My comment > was thus referring to concurrent code. If you have a single-threaded > program, then you can avoid synchronization, obviously (ignoring > synchronization just for reentrancy...). > And we for malloc use a switch variable to avoid lock path and set it when pthread_create is called? For reentancy a ordinary variable suffices. > > > > > > > Second problem is that fastbins are per-arena not per-thread which > > > > forces us to use atomic operations. These are expensive (typicaly more than 50 cycles). > > > > > > Especially on x86, atomic operations that *hit in the cache* have become > > > very fast compared to their costs in the past. I don't have current > > > numbers, but I believe the 50 cycle number is too high; I vaguely > > > remember 10-20. > > > > A simple benchmark could check a real cost. A problem is that while for core2 > > and i7 cost of CAS is around 20 cycles for bulldozer its still 50 cycles. > > Sure, this differs per architecture. But recent Intel CPUs *are* > common, so it's not quite correct to say that the latency (or > throughput, depending on the algorithm) of an atomic RMW/CAS op is > really the primary problem. > > > Even with these a malloc+free pair contains 5 atomic instructions on > > fastbin path which gives 100 cycle penalty (malloc: lock, get item from bin, unlock, > > free: set bit that fastbins are in use, put item to bin) > > Maybe, but you still need to be careful when drawing conclusions from > that because there are more performance effects in allocation than just > the "cycle penalty" you mention (e.g., locality and other memory-system > effects caused by differences in where data is placed). > Its not just that, pretty much every modern allocator uses some per-thread cache so it looks like good idea, also oprofile shows lock cmpxchg instruction as one of likely culprits. When I try a per-thread cache that I posted earlier on a test that does just malloc and free it nearly doubles performance. There are several factors that come in play a first one is lack of locking, second is getting expects correct, third is saving a call of int_malloc. Then there are several cycles saved by omiting test for hook and malloc_perturb. Real implementation will be bit faster as dynamic tls slows this down a bit. Memory system effects are not a factor here, as allocation pattern is identical (stack in both cases). For benchmark use following commands: $ gcc test.c -O3 -o test $ gcc malloc_cache.c -O3 -fPIC -shared -mtls-dialect=gnu2 -lpthread -ldl -o malloc_cache.so $ /usr/bin/time ./a.out Command exited with non-zero status 32 0.29user 0.00system 0:00.29elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps $ /usr/bin/time ./a.out Command exited with non-zero status 32 0.30user 0.00system 0:00.30elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps $ /usr/bin/time ./a.out Command exited with non-zero status 32 0.30user 0.00system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 512maxresident)k 0inputs+0outputs (0major+168minor)pagefaults 0swaps $ LD_PRELOAD=./malloc_cache.so /usr/bin/time ./a.out Command exited with non-zero status 32 0.19user 0.04system 0:00.23elapsed 98%CPU (0avgtext+0avgdata 117796maxresident)k 0inputs+0outputs (0major+29503minor)pagefaults 0swaps $ LD_PRELOAD=./malloc_cache.so /usr/bin/time ./a.out Command exited with non-zero status 32 0.17user 0.04system 0:00.22elapsed 99%CPU (0avgtext+0avgdata 117796maxresident)k 0inputs+0outputs (0major+29503minor)pagefaults 0swaps $ LD_PRELOAD=./malloc_cache.so /usr/bin/time ./a.out Command exited with non-zero status 32 0.12user 0.06system 0:00.19elapsed 99%CPU (0avgtext+0avgdata 117796maxresident)k 0inputs+0outputs (0major+29503minor)pagefaults 0swaps
Attachment:
test.c
Description: Text document
Attachment:
malloc_cache.c
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |