This is the mail archive of the
mailing list for the glibc project.
Re: malloc: performance improvements and bugfixes
- From: Torvald Riegel <triegel at redhat dot com>
- To: Jörn Engel <joern at purestorage dot com>
- Cc: "GNU C. Library" <libc-alpha at sourceware dot org>, Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, Joern Engel <joern at purestorage dot org>
- Date: Wed, 27 Jan 2016 22:37:44 +0100
- Subject: Re: malloc: performance improvements and bugfixes
- Authentication-results: sourceware.org; auth=none
- References: <1453767872-19161-1-git-send-email-joern at purestorage dot com> <1453810961 dot 4592 dot 100 dot camel at localhost dot localdomain> <20160126171435 dot GG5745 at Sligo dot logfs dot org> <1453828825 dot 4592 dot 108 dot camel at localhost dot localdomain> <20160126172629 dot GH5745 at Sligo dot logfs dot org> <1453829759 dot 4592 dot 116 dot camel at localhost dot localdomain> <20160126175940 dot GK5745 at Sligo dot logfs dot org> <1453900833 dot 4592 dot 134 dot camel at localhost dot localdomain> <20160127173842 dot GQ5745 at Sligo dot logfs dot org>
On Wed, 2016-01-27 at 09:38 -0800, JÃrn Engel wrote:
> On Wed, Jan 27, 2016 at 02:20:33PM +0100, Torvald Riegel wrote:
> > Please note that I was specifically talking about the *model* of
> > workloads. It is true that testing with specific programs running a
> > certain workload (ie, the program's workload, not malloc's) can yield a
> > useful data point. But it's not sufficient if one wants to track
> > performance of a general-purpose allocator, unless one runs *lots* of
> > programs with *lots* of different program-specific workloads.
> > IMO we need a model of the workload that provides us with more insight
> > into what's actually going on in the allocator and the program -- at the
> > very least so that we can actually discuss which trade-offs we want to
> > make. For example, "proprietary big application X was 10% faster" is a
> > data point but doesn't tell us anything actionable, really (except that
> > there is room for improvement). First, why was it faster? Was it due
> > to overheads of actual allocation going down, or was it because of how
> > the allocator places allocated data in the memory hierarchy, or
> > something else? Second, what kinds of programs / allocation patterns
> > are affected by this?
> > Your description of the problems you saw already hinted at some of these
> > aspects but, for example, contained little information about the
> > allocation patterns and memory access patterns of the program during
> > runtime. For example, considering NUMA, allocation patterns influence
> > where allocations end up, based on how malloc works; this and the memory
> > access patterns of the program then affect performance. You can't do
> > much within a page after allocation, so kernel-level auto-NUMA or such
> > has limits. This becomes a bigger problem with larger pages. Thus,
> > there won't be a malloc strategy that's always optimal for all kinds of
> > allocation and access patterns, so we need to understand what's going on
> > beyond "program X was faster". ISTM that you should be able to discuss
> > the allocation and access patterns of your application without revealing
> > the internals of your application.
> I find it tedious to discuss workload patterns while I have patches that
> improve the situation and, for whatever reasons, get ignored. But let
> me humour you anyway.
First, you know what our reasons are. If you show up in a community
and, for some reasons, ignore this community's rules, you can't expect
this community to just fall over and pretend the rules are as you'd like
them to be.
Second, you can't expect this community to look at code dumps that don't
follow the community's rules for contributions, for obvious reasons.
Third, without discussing workload patterns, it's not clear whether your
contributions actually improve the situation because we don't know what
"the situation" is precisely (as it depends on your use cases). (Some
changes may be generally good, but I suspect that most affect some
trade-off (e.g., see the discussion with Szabolcs).
> The kernel's auto-NUMA falls apart as soon as either the memory or the
> thread moves to the wrong node. That happens all the time. And the
> kernel will only give you NUMA-local memory once, at allocation time.
> If the application holds on to that memory for days or years, it such
> one-time decisions don't matter.
My recollection is that newer auto-NUMA (or a daemon -- I don't remember
the names precisely) will indeed move pages to the nodes they are
accessed from. Either way, if the page or thread moves to a place
further away in the memory hierarchy, there's little the allocator can
do; what it can do is try to make that less likely, for example by
getting the initial allocation right (ie, allocate in a page that's
local at allocation time) -- but this again depends on the allocation
and usage patterns in the program to some extent. Hence my question
about these in your program.
> That is why my code calls getcpu() to see which NUMA node we are on
> right now and then returns memory from a NUMA-local arena. The kernel
> might migrate the thread between getcpu() and memory access, but at
> least you get it right 99% of the time instead of 50% (for two nodes).
> So if you want a model for this, how about:
> - multiple threads,
> - no thread affinities,
> - memory moves between threads,
> - application runs long enough to make kernel's decision irrelevant.
That's less than the level of detail we need. It's obvious we have to
consider multiple threads, but how many exactly?
Not selecting thread affinities is a practical choice, though if your
program owns the machine as you indicated earlier, pinning threads to
CPUs isn't an unusual thing to do.
I'm not sure what you mean by "memory moves between threads". Threads
move to different NUMA nodes, pages may be moved by the kernel too. You
might also mean that threads access memory allocated by other threads.
But eventually, we need more detail there to make a decent choice that
is not just arbitrary and based on some incomplete measurements. For
example, for a start, what's the likelihood of a remote vs. local memory
access? What's the relation between memory accesses and allocation?
(IOW, what's the average lifetime of an allocation in memory accesses?)
Given auto-NUMA with runtime movement of pages, the kernel's decisions
do seem to matter.
> Btw, this little detail is rather annoying:
> Note: There is no glibc wrapper for this system call; see NOTES.
> Without a wrapper it is rather painful to use the vdso variant of
> getcpu(). My code does a syscall for every allocation, which is wasting
> performance. And sched_getcpu() doesn't help, because I care about the
> node, not the cpu.
> I have created a couple of testcases. You can also consider them
> microbenchmarks - with all the usual problems. How useful they are
> might be debateable, but you might consider them codified models.
These are all stress tests for certain things you seem to think are a
problem. However, I'd be surprised if these were actually
representative of your big program: you oversubscribe (ie, twice as much
threads as cores) and have extremely short allocation lifetimes (store
to all bytes and read them twice, sequentially).
You also include a fork torture test. Is fork really called frequently
in your big application?
Note that I'm not saying that these won't ever affect real-world
programs, but I'm wondering how likely those issues are to appear in
optimized big applications.
> Another btw, I created my own testcases because it was less painful to
> do so than to extract testcases from libc, tcmalloc or jemalloc. Not
> that I am setting a good example, but clearly none of these projects
> care about some other allocator reusing their testcases.
> So if you genuinely want to help, maybe the best thing would be to
> extract the testcases from all these projects, create a new "malloctest"
> or whatever and make it easy to evaluate and compare allocators with
> your test. Pass/fail would be great, benchmarks would be even better.
> Then shame the guilty parties into improvements.
Something similar is what we would suggest to users: If they believe
glibc has a performance problem, and they want to help fix it, they
should describe the problem and workload in sufficient detail, and add
performance tests, microbenchmarks, or similar that allows glibc folks
to (1) classify the performance problem, (2) have a measurement that is
considered to be a good indicator of the problem. If we then can agree
that the performance problem is likely relevant for real-world use
cases, we have an actionable task for improving this (because it's clear
what the goal is, whether it can be measured, and it can be
regression-tested in the future). Note that we'll also have to
trade-off against other uses if general-purpose code is affected, given
that we have to serve all users.
We appreciate that you want to help glibc, and thus also likely lessen
the burden for you to carry your non-upstream patches. But because we
serve many users, we need to do so in a way that minimizes the negative
effect any changes may have on other applications. We need to provide
justification for performance-related changes so that future developers
know why certain choices were made and the effect of the changes can be
reassessed in the future. Note that this isn't opposition to change
because of some hypothetical concern; instead, it's required to keep
glibc maintainable and to not just amass a set of hacks and ad-hoc
changes that nobody can understand anymore in a few years.