This is the mail archive of the
mailing list for the glibc project.
Re: The direction of malloc?
- From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- To: libc-alpha at sourceware dot org
- Date: Tue, 10 Dec 2013 15:53:06 -0200
- Subject: Re: The direction of malloc?
- Authentication-results: sourceware.org; auth=none
- References: <52A6A0DA dot 1080109 at redhat dot com> <CANu=Dmi32gwk-hQ3dDbj0d4_gs3FWqt02+NmveXH1p03Vm+Mfg at mail dot gmail dot com> <20131210121622 dot GA5416 at domone dot podge>
On 10-12-2013 10:16, OndÅej BÃlka wrote:
> On Tue, Dec 10, 2013 at 10:42:54AM +0000, Will Newton wrote:
>> On 10 December 2013 05:04, Carlos O'Donell <email@example.com> wrote:
>>> For a long time we have refused patches to malloc because they would
>>> change the formatting of the code and make it difficult to ... I can
>>> only assume... synchronize with the rapid upstream development of
>>> It is my opinion that glibc should forge ahead on a new direction for
>>> malloc. We should accept any such patches that would:
>>> * Make it easier for us to maintain the malloc implementation.
>>> - Removing dead code.
>>> - Simplifying macros.
>>> - Removing features we don't use.
>>> * Add new and required features.
>>> - Add support for "have but don't need" kernel pages via vrange.
>>> * Fix bugs.
>>> We should accept these patches without the restrictions we previously
>>> placed on the malloc implementation. We should use GNU formatting to
>>> make the code match all of our other styles. In effect we should
>>> own the copy of ptmalloc2 we started with and change it into the
>>> implementation that we think our users need.
>> I think this is a sound approach. As far as I can tell the glibc
>> malloc has diverged significantly from ptmalloc2 to the point of
>> making changes hard to merge and we shouldn't let this stop us from
>> making incremental improvements to what we have. Also the glibc code
>> is by definition much more widely tested than any specific version of
>> ptmalloc or dlmalloc I think.
>> There is ptmalloc3 but I suspect the upheaval of merging that would be
>> equivalent to switching to any other 3rd party allocator and there are
>> much more advanced allocators available these days.
> There was proposal to use jemalloc, threading building blocks malloc,
> ptmalloc3, I add tcmalloc modified to return memory to system, others?
In fact latest version of tcmalloc do return memory to system, and it also has
tuning variables specific how much and in which frequency to release system
memory. And here at IBM we had good resulting using gperftool compare to
glibc memory allocator in both single and multithread environments.
> Also there is possibility that we do not need to use whole allocator but
> take ideas from these and make parts of allocator more effective.
I think first phase could be get ideas on how to improve the current memory allocators,
* Should we provide thread cache blocks to do provide some lockless allocation?
* Should we provide different allocators based on different algorithms (like AIX one)?
* Should we provide runtime tunables?
And then we can see if the idea fits on the current implementation.
>> One of the things I really need to get into shape is some kind of
>> benchmarking for malloc, including some simple micro-benchmarks and
>> some applications that can be used to test allocator performance. I
>> would not like to embark on integrating a new allocator without some
>> really solid benchmarks in place.
> Problem with microbenchmarks is that you would repeat errors of
> allocator research in 60's where allocators were written on
> microbenchmarks that did not match reality and real performance suffered.
> When reading papers about allocators you must check if they make sense,
> take this one, which measures fragmentation.
> This paper looks good in principle, they collect allocator/free data and
> replay it. Still there is a flaw that they eliminate allocator overhead by
> calling malloc (size - overhead).
> This makes allocator behave differently than in original program and
> could theoreticaly change result. A proper way would be adding a counter
> that tracks overhead of allocator and subtract that in calculations.
> For benchmarking malloc you need to measure a running time and total
> space usage of programs to get reliable results, microbenchmarks make
> sense when you are sure that your optimalization does not change factors
> that they cannot measure.
One idea is to use the tcmalloc defect tracker that contains some synthetic programs
that were used to fix some pattern compared to GLIBC. It may provide an initial
To measure the program space usage, should we focus on both virtual size or just RSS?
This will require analysis like peak usage and mean usage I think.
> The tricky part about malloc is that performance depends on caches, for
> example your application repeately allocates big array and does some
> computation on that:
> volatile char *x;
> int foo ()
> free (x);
> x = calloc (30000);
> Now if allocator can choose between 10 possible locations a best case is
> use LIFO order which keeps that memory hot. Worst case is FIFO best-fit
> where each allocation gives area that is not in L1 cache and for that we
> must evict data from previous allocation.
> I now play with that best-fit does not have to be best strategy, we
> first look in recently freed chunks for best fit and take that if its
> within 10% from best fit and similar heuristic.
> This contrary to ptmalloc documentation that we use best-fit FIFO it
> does not cause much harm. Small chunks are placed on stack and to
> first pick from unsorted array also helps.
> Then when you introduce thread this gets more complicated. You also
> could have a problem that cache layout is different across cores.
> For example you have two threads that allocate and frequently update
> 32-byte objects. When cache line contains objects from both threads
> there will be lot of unnecessary ping-pong to keep coherence.
> This does not matter much for larger allocations as L3 cache is
> typically shared.
> A current implementation also mostly avoids this by having separate
> arenas for threads that minimalize false sharing.