This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Malloc improvements
- From: Anton Blanchard <anton at au1 dot ibm dot com>
- To: DJ Delorie <dj at redhat dot com>
- Cc: carlos at redhat dot com, sid at reserved-bit dot com, libc-alpha at sourceware dot org
- Date: Tue, 12 Jul 2016 22:40:47 +1000
- Subject: Re: Malloc improvements
- Authentication-results: sourceware.org; auth=none
- References: <20160712101010.6e6cfecb@kryten> <xnlh17k7qt.fsf@greed.delorie.com>
Hi DJ,
> Hmmm... not sure why that test case is worse in my branch, the whole
> point of my work is to add a lockless fast path. I'll have to
> investigate that some more. Conveniently, I have a trace feature in
> there that I'm also working on ;-)
The trace function looks great, I've started playing with it on ppc64.
> But yeah, we know there's a lot of unneeded overhead in glibc's
> malloc, and that other mallocs can do better. Glibc's malloc itself
> states that it's not trying to be the best at one task, but the best
> usable compromise. Part of what I'm doing is understanding where
> these compromises cause significant performance problems, and seeing
> if we can find other ways to solve them without just replacing it
> with something else.
Sounds good.
> Also, of course, obligatory note... your test case is not a typical
> application. We're trying to come up with a way of modelling real
> applications and benchmarking those, instead of relying on trivial
> test cases to "represent" the real world. A change that is worse in
> a test case might be better for real apps, and visa-versa.
Yeah it is, and we can help gathering traces.
I just used the trace feature on omnetpp from SPECint2006 (something
glibc malloc does pretty badly at) and it shows a semi regular repeating
pattern of ~4000 small mallocs (32, 192, 224 bytes), followed by
the freeing of all of them.
One potential issue - I struggled to capture the entire run.
Even after bumping the buffer a bunch, I only traced a fraction of the
run:
749055656 out of 10000000 events captured
And the output file was almost 1GB. Having the intermediate ASCII output
is nice though, so I'm not arguing for getting rid of it. After
processing, the binary file size ended up at 30MB.
I'm not sure if I am running the tools correctly (or if I need to add
anything for ppc64 other than the rdtsc* functions), but trace_run
spends most of its time in pthread mutexes on POWER8:
Overhead Command Shared Object Symbol
33.37% trace_run libpthread-2.23.so [.] __lll_unlock_elision
24.23% trace_run libpthread-2.23.so [.] __lll_lock_elision
10.86% trace_run trace_run [.] free_wipe
10.74% trace_run trace_run [.] thread_common
10.51% trace_run libpthread-2.23.so [.] pthread_mutex_lock
2.84% trace_run libpthread-2.23.so [.] pthread_mutex_unlock
2.27% trace_run libc-2.23.so [.] _int_free
1.61% trace_run libc-2.23.so [.] malloc
1.06% trace_run libc-2.23.so [.] _int_malloc
Anton