This is the mail archive of the
mailing list for the glibc project.
malloc: performance improvements and bugfixes
- From: Joern Engel <joern at purestorage dot com>
- To: "GNU C. Library" <libc-alpha at sourceware dot org>
- Cc: Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, Joern Engel <joern at purestorage dot org>
- Date: Mon, 25 Jan 2016 16:24:39 -0800
- Subject: malloc: performance improvements and bugfixes
- Authentication-results: sourceware.org; auth=none
From: Joern Engel <firstname.lastname@example.org>
We have forked libc malloc and added a bunch of patches on top. Some
patches help performance, some fix bugs, many just change the code to
my personal liking. Here is a braindump that is _not_ intended to be
merged, at least not as-is. But individual bits could and should get
When upgrading glibc from 2.13 to a newer version, we started hitting
OOM bugs. These were caused by enabling PER_THREAD on newer versions.
We split malloc-2.13 from our previous libc and used that instead.
The beginning of a fork.
Later we found various other problems. Since we now owned the code,
we made use of it. Overall our version is roughly on-par with
jemalloc while libc malloc gets replaced by most projects that care
about performance and use multithreading.
Some of our changes may be completely unpalatable to libc. I made no
distinction and give you the entire list - if only to see what some
people might care about.
Use Lindent and unifdef.
I happen to prefer the kernel coding style over GNU coding style.
These only helped me read the code and make changes, but are
absolutely no upstream material. Sorry about the noise.
Per-thread arenas are an exquisitely bad idea. If a thread uses a lot
of memory, then frees most, malloc will hang on to the free memory and
neither return it to the system nor use it for other threads.
While I admit that some people might care about commit charge, I wager
that most people don't and in particular we don't. The way malloc
uses mprotect turned the mmap_sem into the single worst lock inside
the Linux kernel. Removing mprotect mostly fixed that.
Mprotect also triggers bad behaviour in the kernel VM. Far more VMAs
get created and after reaching 64k the kernel would stop to mmap() for
our process. We effectively run out of memory with gigabytes of free
memory available to the system.
In our project hugepages have become a necessity for low latency.
Transparent hugepages aren't good enough, so we have to deal with them
explicitly. Probably not upstream-material.
Cleanup of arena_get macros
Removes duplicate (and buggy) code and simplifies the logic. Existing
code outgrew the size where macros may have made sense.
Once you have a NUMA system, this helps a lot. Currently does a
syscall for every allocation. Surprisingly the syscall hardly shows
up in profiles and the benefits clearly dominate. If libc exposed the
vdso-version of getcpu(), that would be much nicer.
I benchmarked the effect. Even if I reversed the logic and marked
unlikely branches likely and vice versa, there was absolutely no
measureable effect. Filed under cargo cult and removed.
Revert 1d05c2fb9c6f (Val and Arjan's change to dynamically grow heaps)
I couldn't figure out how the logic actually worked. While I might
not be the best programmer in the world, I find that disturbing for
what is conceptually such a simple change. Hence,...
Not sure if this was a good change, but the the atomic_free_list
(below) recovered the performance, covers more than just fastbins and
is simpler code.
Added a thread cache
A per-thread cache gives most of the performance benefits of
per-thread arenas without the drawback of memory bloat. 128k is less
than most people's stack consumption, so the cost should be
Makes free() lockless. If the arena is locked, we just push memory to
the atomic_free_list and let someone else do the actual free later.
Before this change we had an average of three (3) threads blocked on
an arena lock in the stable state.
Fix startup race
I suppose noone ever hit this because the main_arena initialized so
darn fast that they always won the race. I changed the timings,
mostly with NUMA code, and started losing.
I believe the same also happened upstream and later got reverted. I
couldn't find the rationale for the revert and find it dodgy.
Technically the existing version of calloc can be faster early on, but
not for long-running processes in the stable state. And once I found
bugs in calloc I couldn't be arsed to debug them and just removed most
of the code.
Made malloc signal-safe
I think malloc() was always signal-safe, but free() wasn't. It isn't
hard to trigger this in a testcase. Our version survives such a test,
mostly because of the atomic_free_list.
Fix calculation of aligned heaps
Looks like this was always buggy. Is that correct or was I misreading
I don't understand what problem they were supposed to solve. Our
project doesn't seem to need them and I have testcases that break
because of the hooks.
If any of this looks interesting for upstream and you have questions,
feel free to pester me.
And maybe as a closing note, I believe there are some applications that
have deeper knowledge about malloc-internal data structures than they
should (*cough*emacs). As a result it has become impossible to change
the internals of malloc without breaking said applications and libc
malloc has ossified.
At this point, either a handful of applications need to ship the
ossified version of malloc or Everthing Else(tm) has to switch to a
better version of malloc. The reality we live in has everything else
ship tcmalloc, jemalloc or somesuch and libc malloc is slowly becoming
irrelevant and the butt of hallway jokes. I don't find this reality
very desireable, and yet here we are.