This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

malloc() performance and pthreads



 Hello,

 I've recently discovered that KDE3 applications runs slower compared to 
their KDE2 versions, even if the source is almost the same. The reason for 
this turned out to be the fact that KDE3 libraries link to threaded version 
of Qt (and hence also -lpthread), unlike KDE2. Linking to -lpthread causes 
malloc() to do locking, and since dynamic memory allocation is very 
extensively used in Qt/KDE for various reasons, this causes noticeable 
performance decrease. The "right" way of fixing this problem might be making 
Qt/KDE not to use malloc() so extensively, but for now, an easier solution is 
improving the malloc() performance itself.

 As far as I can say, the malloc implementation used in glibc seems to be 
more or less ptmalloc, which is mainly tuned for multiple threads. This is 
however the worse case for KDE. All KDE apps are linked to -lpthread because 
of keeping binary compatibility, so all of them use malloc() with locking, 
even though almost all KDE apps are single-threaded right now. Even worse, 
we're (for various reasons) using malloc() so extensively, that every 
instruction in malloc() counts, and even things like using or not using the 
inline keyword even for large functions seems to affect performance too.

 To give you some numbers, I've measured time needed to fully render 
$QTDIR/doc/html/functions.html (a large HTML page) in KDE3 Konqueror on a 
K6/188 computer. The system is SuSE7.2, glibc-2.2.4, gcc-2.95.3 . Simply 
using glibc, the time needed was around 60s. When using LD_PRELOAD with a 
tuned malloc() implementation, I managed to reduce the time to 39s.
 You can see the difference also with 
http://dforce.sh.cvut.cz/~seli/download/malloc.tar.bz2 ( just 'make', and it 
will run the test app without -lpthread, with -lpthread, without -lpthread 
with the tuned malloc() and with -lpthread with the tuned malloc() ).

 The tuned malloc() implementation was Doug Lea's malloc (i.e. the one on 
which ptmalloc is based), and for locking I simply surrounded all the calls 
with a spinlock, taken from LinuxThreads. It also doesn't have hooks, and I 
added few inline keywords here and there.

 The first question is, if you can do something about the malloc() 
implementation in glibc (e.g. using spinlocks instead of mutexes, as that 
seems to improve the performance noticeably). And since I'm assuming you 
cannot do as much optimizations in malloc() in glibc as I did in mine, I'm 
thinking about making KDE use its own version, and I'd like if you could help 
me with one detail about spinlocks.
 
 The problem I have with spinlocks is that I don't know how much of the code 
is sufficient in order to make the locking really work. I've looked at the 
sources of pthread_mutex_lock() etc. and it does many more things than just 
calling testandset(). There are some memory barriers, the compare_and_swap() 
variant is prefered over testandset() even though it appears more complex to 
me, etc.

 Could somebody please tell me which parts from LinuxThreads should I use? In 
the sources I linked above, I just used testandset() with sched_yield() for 
locking, and just reseting the value for unlocking (search for KLM_THR). Is 
that sufficient for malloc(), or what's missing?

 Thanks

-- 
 Lubos Lunak
 llunak@suse.cz ; l.lunak@kde.org
 http://dforce.sh.cvut.cz/~seli


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]