This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

malloc: performance improvements and bugfixes

From: Joern Engel <joern at purestorage dot com>
To: "GNU C. Library" <libc-alpha at sourceware dot org>
Cc: Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, Joern Engel <joern at purestorage dot org>
Date: Mon, 25 Jan 2016 16:24:39 -0800
Subject: malloc: performance improvements and bugfixes
Authentication-results: sourceware.org; auth=none

From: Joern Engel <joern@purestorage.org>

Short version:
We have forked libc malloc and added a bunch of patches on top.  Some
patches help performance, some fix bugs, many just change the code to
my personal liking.  Here is a braindump that is _not_ intended to be
merged, at least not as-is.  But individual bits could and should get
extracted.

Long version:
When upgrading glibc from 2.13 to a newer version, we started hitting
OOM bugs.  These were caused by enabling PER_THREAD on newer versions.
We split malloc-2.13 from our previous libc and used that instead.
The beginning of a fork.

Later we found various other problems.  Since we now owned the code,
we made use of it.  Overall our version is roughly on-par with
jemalloc while libc malloc gets replaced by most projects that care
about performance and use multithreading.

Some of our changes may be completely unpalatable to libc.  I made no
distinction and give you the entire list - if only to see what some
people might care about.


Rough list:

Use Lindent and unifdef.
I happen to prefer the kernel coding style over GNU coding style.
These only helped me read the code and make changes, but are
absolutely no upstream material.  Sorry about the noise.

Revert PER_THREAD.
Per-thread arenas are an exquisitely bad idea.  If a thread uses a lot
of memory, then frees most, malloc will hang on to the free memory and
neither return it to the system nor use it for other threads.

Remove mprotect
While I admit that some people might care about commit charge, I wager
that most people don't and in particular we don't.  The way malloc
uses mprotect turned the mmap_sem into the single worst lock inside
the Linux kernel.  Removing mprotect mostly fixed that.
Mprotect also triggers bad behaviour in the kernel VM.  Far more VMAs
get created and after reaching 64k the kernel would stop to mmap() for
our process.  We effectively run out of memory with gigabytes of free
memory available to the system.

Use hugepages
In our project hugepages have become a necessity for low latency.
Transparent hugepages aren't good enough, so we have to deal with them
explicitly.  Probably not upstream-material.

Cleanup of arena_get macros
Removes duplicate (and buggy) code and simplifies the logic.  Existing
code outgrew the size where macros may have made sense.

NUMA support
Once you have a NUMA system, this helps a lot.  Currently does a
syscall for every allocation.  Surprisingly the syscall hardly shows
up in profiles and the benefits clearly dominate.  If libc exposed the
vdso-version of getcpu(), that would be much nicer.

Remove __builtin_expect
I benchmarked the effect.  Even if I reversed the logic and marked
unlikely branches likely and vice versa, there was absolutely no
measureable effect.  Filed under cargo cult and removed.

Revert 1d05c2fb9c6f (Val and Arjan's change to dynamically grow heaps)
I couldn't figure out how the logic actually worked.  While I might
not be the best programmer in the world, I find that disturbing for
what is conceptually such a simple change.  Hence,...

Removed ATOMIC_FASTBINS
Not sure if this was a good change, but the the atomic_free_list
(below) recovered the performance, covers more than just fastbins and
is simpler code.

Added a thread cache
A per-thread cache gives most of the performance benefits of
per-thread arenas without the drawback of memory bloat.  128k is less
than most people's stack consumption, so the cost should be
acceptable.

Added atomic_free_list
Makes free() lockless.  If the arena is locked, we just push memory to
the atomic_free_list and let someone else do the actual free later.
Before this change we had an average of three (3) threads blocked on
an arena lock in the stable state.

Fix startup race
I suppose noone ever hit this because the main_arena initialized so
darn fast that they always won the race.  I changed the timings,
mostly with NUMA code, and started losing.

Simplify calloc
I believe the same also happened upstream and later got reverted.  I
couldn't find the rationale for the revert and find it dodgy.
Technically the existing version of calloc can be faster early on, but
not for long-running processes in the stable state.  And once I found
bugs in calloc I couldn't be arsed to debug them and just removed most
of the code.

Made malloc signal-safe
I think malloc() was always signal-safe, but free() wasn't.  It isn't
hard to trigger this in a testcase.  Our version survives such a test,
mostly because of the atomic_free_list.

Fix calculation of aligned heaps
Looks like this was always buggy.  Is that correct or was I misreading
the code?

Remove hooks
I don't understand what problem they were supposed to solve.  Our
project doesn't seem to need them and I have testcases that break
because of the hooks.


If any of this looks interesting for upstream and you have questions,
feel free to pester me.

And maybe as a closing note, I believe there are some applications that
have deeper knowledge about malloc-internal data structures than they
should (*cough*emacs).  As a result it has become impossible to change
the internals of malloc without breaking said applications and libc
malloc has ossified.

At this point, either a handful of applications need to ship the
ossified version of malloc or Everthing Else(tm) has to switch to a
better version of malloc.  The reality we live in has everything else
ship tcmalloc, jemalloc or somesuch and libc malloc is slowly becoming
irrelevant and the butt of hallway jokes.  I don't find this reality
very desireable, and yet here we are.

JÃrn

Follow-Ups:
- [PATCH] malloc: use MAP_HUGETLB when possible
  - From: Joern Engel
- [PATCH] malloc: kill mprotect
  - From: Joern Engel
- [PATCH] malloc: Lindent new_heap
  - From: Joern Engel
- [PATCH] malloc: push down the memset for huge pages
  - From: Joern Engel
- [PATCH] malloc: remove dead code
  - From: Joern Engel
- [PATCH] malloc: unifdef -m -Ulibc_hidden_def
  - From: Joern Engel
- [PATCH] malloc: unifdef -m -DUSE_ARENAS -DHAVE_MMAP
  - From: Joern Engel
- [PATCH] malloc: remove emacs style guards
  - From: Joern Engel
- [PATCH] malloc: initial numa support
  - From: Joern Engel
- [PATCH] malloc: remove mstate typedef
  - From: Joern Engel
- [PATCH] malloc: turn arena_get() into a function
  - From: Joern Engel
- [PATCH] malloc: introduce get_backup_arena()
  - From: Joern Engel
- [PATCH] malloc: unifdef -D__STD_C
  - From: Joern Engel
- [PATCH] malloc: unobfuscate an assert
  - From: Joern Engel
- [PATCH] malloc: use mbind()
  - From: Joern Engel
- [PATCH] malloc: Lindent before functional changes
  - From: Joern Engel
- [PATCH] malloc: use tsd_getspecific for arena_get
  - From: Joern Engel
- [PATCH] malloc: Lindent public_fREe()
  - From: Joern Engel
- [PATCH] malloc: create a useful assert
  - From: Joern Engel
- [PATCH] malloc: remove __builtin_expect
  - From: Joern Engel
- [PATCH] malloc: Lindent users of arena_get2
  - From: Joern Engel
- [PATCH] malloc: fix mbind on old kernels
  - From: Joern Engel
- [PATCH] malloc: unifdef -m -UPER_THREAD -U_LIBC
  - From: Joern Engel
- [PATCH] malloc: prefetch for tcache_malloc
  - From: Joern Engel
- [PATCH] malloc: unifdef -m -UATOMIC_FASTBINS
  - From: Joern Engel
- [PATCH] malloc: brain-dead thread cache
  - From: Joern Engel
- [PATCH] malloc: Revert glibc 1d05c2fb9c6f
  - From: Joern Engel
- [PATCH] malloc: tune thread cache
  - From: Joern Engel
- [PATCH] malloc: use bitmap to conserve hot bins
  - From: Joern Engel
- [PATCH] malloc: hide THREAD_STATS
  - From: Joern Engel
- [PATCH] malloc: use atomic free list
  - From: Joern Engel
- [PATCH] malloc: always free objects locklessly
  - From: Joern Engel
- [PATCH] malloc: avoid main_arena
  - From: Joern Engel
- [PATCH] malloc: add documentation
  - From: Joern Engel
- [PATCH] malloc: fix local_next handling
  - From: Joern Engel
- [PATCH] malloc: only free half the objects on tcache_gc
  - From: Joern Engel
- [PATCH] malloc: s/max_node/num_nodes/
  - From: Joern Engel
- [PATCH] malloc: destroy thread cache on thread exit
  - From: Joern Engel
- [PATCH] malloc: fix hard-coded constant
  - From: Joern Engel
- [PATCH] malloc: limit free_atomic_list() latency
  - From: Joern Engel
- [PATCH] malloc: make numa_node_count more robust
  - From: Joern Engel
- [PATCH] malloc: better inline documentation
  - From: Joern Engel
- [PATCH] malloc: document and fix linked list handling
  - From: Joern Engel
- [PATCH] malloc: quenche last compiler warnings
  - From: Joern Engel
- [PATCH] malloc: fix startup races
  - From: Joern Engel
- [PATCH] malloc: simplify and fix calloc
  - From: Joern Engel
- [PATCH] malloc: remove stale condition
  - From: Joern Engel
- [PATCH] malloc: fix perturb_byte handling for tcache
  - From: Joern Engel
- [PATCH] malloc: add locking to thread cache
  - From: Joern Engel
- [PATCH] malloc: move out perturb_byte conditionals
  - From: Joern Engel
- [PATCH] malloc: plug thread-cache memory leak
  - From: Joern Engel
- [PATCH] malloc: remove get_backup_arena() from tcache_malloc()
  - From: Joern Engel
- [PATCH] malloc: fix calculation of aligned heaps
  - From: Joern Engel
- [PATCH] malloc: allow recursion from ptmalloc_init to malloc
  - From: Joern Engel
- [PATCH] malloc: remove hooks from malloc() and free()
  - From: Joern Engel
- [PATCH] malloc: remove tcache prefetching
  - From: Joern Engel
- [PATCH] malloc: create aliases for malloc, free,...
  - From: Joern Engel
- [PATCH] malloc: Don't call tsd_setspecific before tsd_key_create
  - From: Joern Engel
- [PATCH] malloc: speed up mmap
  - From: Joern Engel
- [PATCH] malloc: remove all remaining hooks
  - From: Joern Engel
- [PATCH] malloc: rename *.ch to *.h
  - From: Joern Engel
- [PATCH] malloc: remove atfork hooks
  - From: Joern Engel
- [PATCH] malloc: define __libc_memalign
  - From: Joern Engel
- Re: malloc: performance improvements and bugfixes
  - From: Paul Eggert
- Re: malloc: performance improvements and bugfixes
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]