This is the mail archive of the
mailing list for the glibc project.
[PATCH] malloc: add documentation
- From: Joern Engel <joern at purestorage dot com>
- To: "GNU C. Library" <libc-alpha at sourceware dot org>
- Cc: Siddhesh Poyarekar <siddhesh dot poyarekar at gmail dot com>, Joern Engel <joern at purestorage dot com>
- Date: Mon, 25 Jan 2016 16:25:13 -0800
- Subject: [PATCH] malloc: add documentation
- Authentication-results: sourceware.org; auth=none
- References: <1453767942-19369-1-git-send-email-joern at purestorage dot com>
tpc/malloc2.13/design | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)
create mode 100644 tpc/malloc2.13/design
diff --git a/tpc/malloc2.13/design b/tpc/malloc2.13/design
new file mode 100644
@@ -0,0 +1,90 @@
+This malloc is based on glibc 2.13 malloc. Doug Lea started it all,
+Wolfram Gloger extended it for multithreading, Ulrich Drepper
+mainained it as part of glibc, then "improved" it so much we had to
+For introduction, please read http://gee.cs.oswego.edu/dl/html/malloc.html
+dlmalloc: Doug Lea's version
+ptmalloc: Wolfram Gloger's version
+libcmalloc: Libc version as of 2.13
+puremalloc: Pure Storage's version
+Large allocations are done by mmap. All small allocations come from
+an arena, which is split into suitable chunks. Dlmalloc had a single
+arena, which provided a locking hotspot. Arena was enlarged by
+sbrk(2). Ptmalloc introduced multiple arenas, creating them on the
+fly when locks when observing lock contention.
+Libcmalloc made arenas per-thread, which further reduced lock
+contention, but significantly increased memory consumption. Glibc
+2.13 was the latest version without per-thread arenas enabled. Costa
+gave Pure a private copy of 2.13 malloc to avoid the regression.
+Arenas are on a single-linked list with a pointer kept in thread-local
+storage. If the last arena used for a thread is locked, it will try
+the next arena, etc. If all arenas are locked, a new arena is
+JÃrn changed this into a single-linked list per numa node. Threads
+always allocate from a numa-local arena.
+All arenas use mbind() to preferentially get memory from just one numa
+node. In case of memory shortage the kernel is allowed to go
+cross-numa. As always, memory shortage should be avoided.
+The getcpu() syscall is used to detect the current numa node when
+allocating memory. If the scheduler moves threads to different numa
+nodes, performance will suffer. No surprise there. Syscall overhead
+could be a performance problem. We have plans to create a shared
+memory page containing information like the thread's current numa node
+to solve that. Surprisingly the syscall overhead doesn't seem that
+high, so it may take a while.
+Using hugepages instead of small pages makes a significant difference
+in process exit time. We have repeatedly observer >20s spent in
+exit_mm(), freeing all process memory. Going from small pages to huge
+pages solves that problem. Puremalloc uses huge pages for all mmap
+The main_arena still uses sbrk() to allocate system memory, i.e. it
+uses small pages. To solve this the main_arena was taken off the
+per-node lists. It is still being used in special cases, e.g. when
+creating a new thread. Changing that is a lot of work and not worth
+tcmalloc and jemalloc demonstrated that a per-thread cache for malloc
+can be beneficial, so we introduced the same to puremalloc. Freed
+objects initially stay in the thread cache. Roughly half the time
+they get reused by an allocation shortly after. If objects exist in
+cache, malloc doesn't have to take the arena lock.
+When going to the arena, puremalloc pre-allocates a second object.
+Pre-allocation further reduces arena contention. Pre-allocating more
+than one object yields diminishing returns. The performance
+difference between thread cache and arena just isn't high enough.
+The binning strategies of dlmalloc and jemalloc are pretty ad-hoc and
+discontinuous. Jemalloc is extremely fine-grained up to 4k, then
+jumps from 4k to 8k to 16k, etc. As a result, allocations slightly
+above 4k, 8k, etc. result in nearly 100% overhead. Dlmalloc is less
+bad, but similar.
+For the thread cache, puremalloc uses 16 bins per power of two. This
+requires an implementation of fls(), which is not standardized in C.
+Fls() done in inline assembly is slower than a predicted jump, but
+faster than a mispredicted jump. Overall performance is about awash.
+Not having corner cases with 100% memory overhead is a real benefit,
+though. Worst-case overhead is 6%, and hard to achieve even when