This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
malloc() performance and pthreads
- From: Lubos Lunak <l dot lunak at sh dot cvut dot cz>
- To: libc-alpha at sources dot redhat dot com
- Date: Mon, 11 Feb 2002 17:26:05 +0100
- Subject: malloc() performance and pthreads
Hello,
I've recently discovered that KDE3 applications runs slower compared to
their KDE2 versions, even if the source is almost the same. The reason for
this turned out to be the fact that KDE3 libraries link to threaded version
of Qt (and hence also -lpthread), unlike KDE2. Linking to -lpthread causes
malloc() to do locking, and since dynamic memory allocation is very
extensively used in Qt/KDE for various reasons, this causes noticeable
performance decrease. The "right" way of fixing this problem might be making
Qt/KDE not to use malloc() so extensively, but for now, an easier solution is
improving the malloc() performance itself.
As far as I can say, the malloc implementation used in glibc seems to be
more or less ptmalloc, which is mainly tuned for multiple threads. This is
however the worse case for KDE. All KDE apps are linked to -lpthread because
of keeping binary compatibility, so all of them use malloc() with locking,
even though almost all KDE apps are single-threaded right now. Even worse,
we're (for various reasons) using malloc() so extensively, that every
instruction in malloc() counts, and even things like using or not using the
inline keyword even for large functions seems to affect performance too.
To give you some numbers, I've measured time needed to fully render
$QTDIR/doc/html/functions.html (a large HTML page) in KDE3 Konqueror on a
K6/188 computer. The system is SuSE7.2, glibc-2.2.4, gcc-2.95.3 . Simply
using glibc, the time needed was around 60s. When using LD_PRELOAD with a
tuned malloc() implementation, I managed to reduce the time to 39s.
You can see the difference also with
http://dforce.sh.cvut.cz/~seli/download/malloc.tar.bz2 ( just 'make', and it
will run the test app without -lpthread, with -lpthread, without -lpthread
with the tuned malloc() and with -lpthread with the tuned malloc() ).
The tuned malloc() implementation was Doug Lea's malloc (i.e. the one on
which ptmalloc is based), and for locking I simply surrounded all the calls
with a spinlock, taken from LinuxThreads. It also doesn't have hooks, and I
added few inline keywords here and there.
The first question is, if you can do something about the malloc()
implementation in glibc (e.g. using spinlocks instead of mutexes, as that
seems to improve the performance noticeably). And since I'm assuming you
cannot do as much optimizations in malloc() in glibc as I did in mine, I'm
thinking about making KDE use its own version, and I'd like if you could help
me with one detail about spinlocks.
The problem I have with spinlocks is that I don't know how much of the code
is sufficient in order to make the locking really work. I've looked at the
sources of pthread_mutex_lock() etc. and it does many more things than just
calling testandset(). There are some memory barriers, the compare_and_swap()
variant is prefered over testandset() even though it appears more complex to
me, etc.
Could somebody please tell me which parts from LinuxThreads should I use? In
the sources I linked above, I just used testandset() with sched_yield() for
locking, and just reseting the value for unlocking (search for KLM_THR). Is
that sufficient for malloc(), or what's missing?
Thanks
--
Lubos Lunak
llunak@suse.cz ; l.lunak@kde.org
http://dforce.sh.cvut.cz/~seli