This is the mail archive of the
mailing list for the glibc project.
Re: malloc: performance improvements and bugfixes
- From: Torvald Riegel <triegel at redhat dot com>
- To: Jörn Engel <joern at purestorage dot com>
- Cc: "GNU C. Library" <libc-alpha at sourceware dot org>, Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, Joern Engel <joern at purestorage dot org>
- Date: Wed, 27 Jan 2016 14:20:33 +0100
- Subject: Re: malloc: performance improvements and bugfixes
- Authentication-results: sourceware.org; auth=none
- References: <1453767872-19161-1-git-send-email-joern at purestorage dot com> <1453810961 dot 4592 dot 100 dot camel at localhost dot localdomain> <20160126171435 dot GG5745 at Sligo dot logfs dot org> <1453828825 dot 4592 dot 108 dot camel at localhost dot localdomain> <20160126172629 dot GH5745 at Sligo dot logfs dot org> <1453829759 dot 4592 dot 116 dot camel at localhost dot localdomain> <20160126175940 dot GK5745 at Sligo dot logfs dot org>
On Tue, 2016-01-26 at 09:59 -0800, JÃrn Engel wrote:
> On Tue, Jan 26, 2016 at 06:35:59PM +0100, Torvald Riegel wrote:
> > On Tue, 2016-01-26 at 09:26 -0800, JÃrn Engel wrote:
> > > On Tue, Jan 26, 2016 at 06:20:25PM +0100, Torvald Riegel wrote:
> > > >
> > > > How do the allocation patterns look like? There can be big variations
> > > > in allocation frequency and size, lifetime of allocated regions,
> > > > relation between allocations and locality, etc. Some programs allocate
> > > > most up-front, others have lots of alloc/dealloc during the lifetime of
> > > > the program.
> > >
> > > Lots of alloc/dealloc during the lifetime. To give you a rough scale,
> > > malloc consumed around 1.7% cputime in the stable state. Now it is down
> > > to about 0.7%.
> > Eventually, I think we'd like to get more detail on this, so that we
> > start tracking performance regressions too and that the model of
> > workloads we have is less hand-wavy than "big application". Given that
> > malloc will remain a general-purpose allocator (at least in the default
> > config / tuning), we'll have to choose trade-off so that they represent
> > workloads, for which we'll have to classify workloads in some way.
> The workload itself is closed source, so the only ones ever testing with
> that workload is likely us. Hence my handwaving. Having some other
> application available to the general public for testing would be nice,
> I suspect there isn't even a shortage of applications to choose from.
> Firefox uses jemalloc. If you can automate some runs in firefox and
> compare jemalloc to libc malloc, you will likely find the same problems
> we encountered. And from what I heard there is no shortage of open
> source applications that switched over to jemalloc or tcmalloc and could
> be used as well.
Please note that I was specifically talking about the *model* of
workloads. It is true that testing with specific programs running a
certain workload (ie, the program's workload, not malloc's) can yield a
useful data point. But it's not sufficient if one wants to track
performance of a general-purpose allocator, unless one runs *lots* of
programs with *lots* of different program-specific workloads.
IMO we need a model of the workload that provides us with more insight
into what's actually going on in the allocator and the program -- at the
very least so that we can actually discuss which trade-offs we want to
make. For example, "proprietary big application X was 10% faster" is a
data point but doesn't tell us anything actionable, really (except that
there is room for improvement). First, why was it faster? Was it due
to overheads of actual allocation going down, or was it because of how
the allocator places allocated data in the memory hierarchy, or
something else? Second, what kinds of programs / allocation patterns
are affected by this?
Your description of the problems you saw already hinted at some of these
aspects but, for example, contained little information about the
allocation patterns and memory access patterns of the program during
runtime. For example, considering NUMA, allocation patterns influence
where allocations end up, based on how malloc works; this and the memory
access patterns of the program then affect performance. You can't do
much within a page after allocation, so kernel-level auto-NUMA or such
has limits. This becomes a bigger problem with larger pages. Thus,
there won't be a malloc strategy that's always optimal for all kinds of
allocation and access patterns, so we need to understand what's going on
beyond "program X was faster". ISTM that you should be able to discuss
the allocation and access patterns of your application without revealing
the internals of your application.