This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] malloc: add documentation

JIRA: PURE-27597
 tpc/malloc2.13/design | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
 create mode 100644 tpc/malloc2.13/design

diff --git a/tpc/malloc2.13/design b/tpc/malloc2.13/design
new file mode 100644
index 000000000000..1c3018093cdd
--- /dev/null
+++ b/tpc/malloc2.13/design
@@ -0,0 +1,90 @@
+This malloc is based on glibc 2.13 malloc.  Doug Lea started it all,
+Wolfram Gloger extended it for multithreading, Ulrich Drepper
+mainained it as part of glibc, then "improved" it so much we had to
+fork it.
+For introduction, please read
+dlmalloc: Doug Lea's version
+ptmalloc: Wolfram Gloger's version
+libcmalloc: Libc version as of 2.13
+puremalloc: Pure Storage's version
+Large allocations are done by mmap.  All small allocations come from
+an arena, which is split into suitable chunks.  Dlmalloc had a single
+arena, which provided a locking hotspot.  Arena was enlarged by
+sbrk(2).  Ptmalloc introduced multiple arenas, creating them on the
+fly when locks when observing lock contention.
+Libcmalloc made arenas per-thread, which further reduced lock
+contention, but significantly increased memory consumption.  Glibc
+2.13 was the latest version without per-thread arenas enabled.  Costa
+gave Pure a private copy of 2.13 malloc to avoid the regression.
+Arenas are on a single-linked list with a pointer kept in thread-local
+storage.  If the last arena used for a thread is locked, it will try
+the next arena, etc.  If all arenas are locked, a new arena is
+JÃrn changed this into a single-linked list per numa node.  Threads
+always allocate from a numa-local arena.
+NUMA locality:
+All arenas use mbind() to preferentially get memory from just one numa
+node.  In case of memory shortage the kernel is allowed to go
+cross-numa.  As always, memory shortage should be avoided.
+The getcpu() syscall is used to detect the current numa node when
+allocating memory.  If the scheduler moves threads to different numa
+nodes, performance will suffer.  No surprise there.  Syscall overhead
+could be a performance problem.  We have plans to create a shared
+memory page containing information like the thread's current numa node
+to solve that.  Surprisingly the syscall overhead doesn't seem that
+high, so it may take a while.
+Using hugepages instead of small pages makes a significant difference
+in process exit time.  We have repeatedly observer >20s spent in
+exit_mm(), freeing all process memory.  Going from small pages to huge
+pages solves that problem.  Puremalloc uses huge pages for all mmap
+The main_arena still uses sbrk() to allocate system memory, i.e. it
+uses small pages.  To solve this the main_arena was taken off the
+per-node lists.  It is still being used in special cases, e.g. when
+creating a new thread.  Changing that is a lot of work and not worth
+it yet.
+Thread caching:
+tcmalloc and jemalloc demonstrated that a per-thread cache for malloc
+can be beneficial, so we introduced the same to puremalloc.  Freed
+objects initially stay in the thread cache.  Roughly half the time
+they get reused by an allocation shortly after.  If objects exist in
+cache, malloc doesn't have to take the arena lock.
+When going to the arena, puremalloc pre-allocates a second object.
+Pre-allocation further reduces arena contention.  Pre-allocating more
+than one object yields diminishing returns.  The performance
+difference between thread cache and arena just isn't high enough.
+The binning strategies of dlmalloc and jemalloc are pretty ad-hoc and
+discontinuous.  Jemalloc is extremely fine-grained up to 4k, then
+jumps from 4k to 8k to 16k, etc.  As a result, allocations slightly
+above 4k, 8k, etc. result in nearly 100% overhead.  Dlmalloc is less
+bad, but similar.
+For the thread cache, puremalloc uses 16 bins per power of two.  This
+requires an implementation of fls(), which is not standardized in C.
+Fls() done in inline assembly is slower than a predicted jump, but
+faster than a mispredicted jump.  Overall performance is about awash.
+Not having corner cases with 100% memory overhead is a real benefit,
+though.  Worst-case overhead is 6%, and hard to achieve even when

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]