This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: replace ptmalloc2


Hello Siddhesh!

On Fri, Oct 10, 2014 at 05:19:20AM +0530, Siddhesh Poyarekar wrote:
> On 10 October 2014 03:24, Jörn Engel <joern@purestorage.com> wrote:
> > I have recently been forced to look at the internals of ptmalloc2.
> > There are some low-hanging fruits for fixing, but overall I find it
> > more worthwhile to replace the allocator with one of the alternatives,
> > jemalloc being my favorite.
> 
> The list archives have a discussion on introducing alternate
> allocators in glibc.  The summary is that we'd like to do it, but it
> doesn't necessarily mean replacing ptmalloc2 completely.  The approach
> we'd like is to have a tunable to select (maybe even for the entire
> OS) the allocator implementation, with ptmalloc2 being the default.

I like that approach.

> > Problems encountered the hard way:
> > - Using per-thread arenas causes horrible memory bloat.  While it is
> >   theoretically possible for arenas to shrink and return memory to the
> >   kernel, that rarely happens in practice.  Effectively every arena
> >   retains the biggest size it has ever had in history (or close to).
> >   Given many threads and dynamic behaviour of individual threads, a
> >   significant ratio of memory can be wasted here.
> 
> The bloat you're seeing is address space and not actually memory.
> That is, the commit charge more often than not is not affected that
> badly.  Also, you can control the number of arenas the allocator
> spawns off by using the MALLOC_ARENA_MAX environment variable (or the
> M_ARENA_MAX mallopt option).

I have seen bloat in physical memory usage.

> > - mmap() returning NULL once 65530 vmas are used by a process.  There
> >   is a kernel-bug that plays into this, but ptmalloc2 would hit this
> >   limit even without the kernel bug.  Given a large system, one can go
> >   OOM (malloc returning NULL) with hundreds of gigabytes free on the
> >   system.
> > - mmap_sem causing high latency for multithreaded processes.  Yes,
> >   this is a kernel-internal lock, but ptmalloc2 is the main reason for
> >   hammering the lock.
> 
> Then arenas is not the problem, it is the address space allocated
> using mmap.  Setting MALLOC_MMAP_THRESHOLD_ to a high enough value
> should bring those into one of the arenas, but you risk fragmentation
> by doing that.  Either way, you might find it useful to use the malloc
> systemtap probes[1] to characterize malloc usage in your program.

Again I have to disagree.  When running enough threads, practically all
memory comes from mmap.  Only one arena can use sbrk, all others have
to grow using mmap - independently of using mmap for large allocations.

> > Possible improvements found by source code inspection and via
> > testcases:
> > - Everything mentioned in
> >   https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
> 
> That paper compares glibc 2.5 implementation with jemalloc and the
> former does not have per-thread malloc.  According to performance
> numbers some customers gave us (Red Hat), per-thread arenas give them
> anywhere between 20%-30% improvement in application speed at the
> expense of additional address space usage.

Clearly you get a performance benefit when using per-thread structures
for the hot path.  But arenas are not a good fit for those per-thread
structures.  They have a tendency to stay near their high-watermark for
memory consumption, independent of current memory consumption.  If a
thread peaks to, say, 1GB, then shrinks down to 1Mb and remains low for
several months, the delta is missing from the system.

Having a smallish per-thread structure that could push memory back to a
common pool (like tcmalloc and jemalloc seem to) would fix that problem.
Staying lockless through the per-thread structure 99% of the time gives
you 99% of the performance benefit of per-thread arenas from the locking
perspective.  Add cache locality effects and the remaining 1% is lost in
the noise.

> > - Arenas are a bad choice for per-thread caches.
> 
> Bad is a strong word.  It works well for a lot of cases.  It doesn't
> so much for others.

For the workload I care about it resulted in catastrophic failure.  That
might be a fine tradeoff for some optional component.  But having a
default allocator that is catastrophic for some would be nice to avoid.

> > - mprotect_size seems to be responsible for silly behaviour.  When
> >   extending the main arena with sbrk(), one could immediately
> >   mprotect() the entire extension and be done.  Instead mprotect() is
> 
> We don't use mprotect on the main arena.
> 
> >   often called in 4k-granularities.  Each call takes the mmap_sem
> >   writeable and potentially splits off new vmas.  Way too expensize to
> >   do in small granularities.
> >   It gets better when looking at the other arenas.  Memory is
> >   allocated via mmap(PROT_NONE), so every mprotect() will split off
> >   new vmas.  Potentially some of them can get merged later on.  But
> >   current Linux kernels contain at least one bug, so this doesn't
> >   always happen.
> >   If someone is arguing in favor of PROT_NONE as a debug- or
> >   security-measure, I wonder why we don't have the equivalent for the
> >   main arena.  Do we really want the worst of both worlds?
> 
> mprotect usage is not just a diagnostic or security measure, it is
> primarily there to reduce the commit charge of the process.  This is
> what keeps the actual memory usage low for processes despite having
> large address space usage.

Interesting.  I assume that "commit charge" would be memory that shows
up as VmRSS in /proc/$PID/status.  In that case, how does mprotect usage
reduce it?  Mmap() returns virtual memory without any mapping,
individual pages get faulted in on demand and mprotect shouldn't make
any difference.

Clearly there is something here I don't understand.  At least I hope so.

> Granularity of mprotect can be evaluated (it should probably be
> max(request, M_TRIM_THRESHOLD)), but again, it should not split off a
> lot of vmas.  At any point, you ought to have only two splits of each
> arena heap - one that is PROT_READ|PROT_WRITE and the other that is
> PROT_NONE since adjacent vmas with the same protection should merge.
> The multiple vmas are either because of arena extensions with arena
> heaps (different concept from the process heap) or due to allocations
> that went directly to mmap.  The latter obviously has more potential
> to overrun the vma limit the way you describe.
> 
> > All of the above have convinced me to abandon ptmalloc2 and use a
> > different allocator for my work project.  But look at the facebook
> > post again and see the 2x performance improvement for their webserver
> > load.  That is not exactly a micro-benchmark for allocators, but
> > translates to significant hardware savings in the real world.  It
> > would be nice to get those savings out of the box.
> 
> It is perfectly fine if you decide to use a different allocator.
> You're obviously welcome to improve the glibc malloc (which definitely
> could use a lot of improvement) and even help build the framework to
> have multiple allocators in glibc to make it easier to choose an
> alternate allocator.  I don't think anybody is working on the latter
> yet.

Is there any prior art I could copy for the multi-allocator framework?
It would be much nicer to steal someone else's good ideas than having to
come up with my own.

Jörn

--
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]