This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: malloc: performance improvements and bugfixes

On Wed, Jan 27, 2016 at 02:20:33PM +0100, Torvald Riegel wrote:
> Please note that I was specifically talking about the *model* of
> workloads.  It is true that testing with specific programs running a
> certain workload (ie, the program's workload, not malloc's) can yield a
> useful data point.  But it's not sufficient if one wants to track
> performance of a general-purpose allocator, unless one runs *lots* of
> programs with *lots* of different program-specific workloads.
> IMO we need a model of the workload that provides us with more insight
> into what's actually going on in the allocator and the program -- at the
> very least so that we can actually discuss which trade-offs we want to
> make.  For example, "proprietary big application X was 10% faster" is a
> data point but doesn't tell us anything actionable, really (except that
> there is room for improvement).  First, why was it faster?  Was it due
> to overheads of actual allocation going down, or was it because of how
> the allocator places allocated data in the memory hierarchy, or
> something else?  Second, what kinds of programs / allocation patterns
> are affected by this?
> Your description of the problems you saw already hinted at some of these
> aspects but, for example, contained little information about the
> allocation patterns and memory access patterns of the program during
> runtime.  For example, considering NUMA, allocation patterns influence
> where allocations end up, based on how malloc works; this and the memory
> access patterns of the program then affect performance.  You can't do
> much within a page after allocation, so kernel-level auto-NUMA or such
> has limits.  This becomes a bigger problem with larger pages.  Thus,
> there won't be a malloc strategy that's always optimal for all kinds of
> allocation and access patterns, so we need to understand what's going on
> beyond "program X was faster".  ISTM that you should be able to discuss
> the allocation and access patterns of your application without revealing
> the internals of your application.

I find it tedious to discuss workload patterns while I have patches that
improve the situation and, for whatever reasons, get ignored.  But let
me humour you anyway.

The kernel's auto-NUMA falls apart as soon as either the memory or the
thread moves to the wrong node.  That happens all the time.  And the
kernel will only give you NUMA-local memory once, at allocation time.
If the application holds on to that memory for days or years, it such
one-time decisions don't matter.

That is why my code calls getcpu() to see which NUMA node we are on
right now and then returns memory from a NUMA-local arena.  The kernel
might migrate the thread between getcpu() and memory access, but at
least you get it right 99% of the time instead of 50% (for two nodes).

So if you want a model for this, how about:
- multiple threads,
- no thread affinities,
- memory moves between threads,
- application runs long enough to make kernel's decision irrelevant.

Btw, this little detail is rather annoying:
       Note: There is no glibc wrapper for this system call; see NOTES.
Without a wrapper it is rather painful to use the vdso variant of
getcpu().  My code does a syscall for every allocation, which is wasting
performance.  And sched_getcpu() doesn't help, because I care about the
node, not the cpu.

I have created a couple of testcases.  You can also consider them
microbenchmarks - with all the usual problems.  How useful they are
might be debateable, but you might consider them codified models.

Another btw, I created my own testcases because it was less painful to
do so than to extract testcases from libc, tcmalloc or jemalloc.  Not
that I am setting a good example, but clearly none of these projects
care about some other allocator reusing their testcases.

So if you genuinely want to help, maybe the best thing would be to
extract the testcases from all these projects, create a new "malloctest"
or whatever and make it easy to evaluate and compare allocators with
your test.  Pass/fail would be great, benchmarks would be even better.
Then shame the guilty parties into improvements.


If you managed to confuse the compiler, you lost your human readers
a long time ago.
-- hummassa

Attachment: test_malloc.tgz
Description: application/gtar-compressed

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]