This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The direction of malloc?

On Tue, Dec 10, 2013 at 03:53:06PM -0200, Adhemerval Zanella wrote:
> On 10-12-2013 10:16, OndÅej BÃlka wrote:
> > On Tue, Dec 10, 2013 at 10:42:54AM +0000, Will Newton wrote:
> >> On 10 December 2013 05:04, Carlos O'Donell <> wrote:
> >>> For a long time we have refused patches to malloc because they would
> >>> change the formatting of the code and make it difficult to ... I can
> >>> only assume... synchronize with the rapid upstream development of
> >>> ptmalloc.
> >>>
> >>> It is my opinion that glibc should forge ahead on a new direction for
> >>> malloc.  We should accept any such patches that would:
> >>>
> >>> * Make it easier for us to maintain the malloc implementation.
> >>>   - Removing dead code.
> >>>   - Simplifying macros.
> >>>   - Removing features we don't use.
> >>>
> >>> * Add new and required features.
> >>>   - Add support for "have but don't need" kernel pages via vrange.
> >>>
> >>> * Fix bugs.
> >>>
> >>> We should accept these patches without the restrictions we previously
> >>> placed on the malloc implementation. We should use GNU formatting to
> >>> make the code match all of our other styles. In effect we should
> >>> own the copy of ptmalloc2 we started with and change it into the
> >>> implementation that we think our users need.
> >> I think this is a sound approach. As far as I can tell the glibc
> >> malloc has diverged significantly from ptmalloc2 to the point of
> >> making changes hard to merge and we shouldn't let this stop us from
> >> making incremental improvements to what we have. Also the glibc code
> >> is by definition much more widely tested than any specific version of
> >> ptmalloc or dlmalloc I think.
> >>
> >> There is ptmalloc3 but I suspect the upheaval of merging that would be
> >> equivalent to switching to any other 3rd party allocator and there are
> >> much more advanced allocators available these days.
> >>
> > There was proposal to use jemalloc, threading building blocks malloc,
> > ptmalloc3, I add tcmalloc modified to return memory to system, others?
> In fact latest version of tcmalloc do return memory to system, and it also has
> tuning variables specific how much and in which frequency to release system
> memory. And here at IBM we had good resulting using gperftool compare to
> glibc memory allocator in both single and multithread environments.
> >
> > Also there is possibility that we do not need to use whole allocator but
> > take ideas from these and make parts of allocator more effective.
> I think first phase could be get ideas on how to improve the current memory allocators,
> like:
> * Should we provide thread cache blocks to do provide some lockless allocation?

This is most low-hanging fruit that I aim for. We already use tls to
determine arena so this should not be a issue.

We have fastbins that sorta do this but with several problems. 
1. They are not really lockless, for malloc they need a lock, only
freeing will be when bug 15073 gets fixed.

Second problem is that fastbins are per-arena not per-thread which
forces us to use atomic operations. These are expensive (typicaly more than 50 cycles).

Moving these to per-thread bins mostly just needs refactoring of current
code to one that makes more sense.

> * Should we provide different allocators based on different algorithms (like AIX one)?

Could you elaborate? It would help if we have three types of allocators
as these need solve different problems.
1. One for fast 64bit systems.
2. Embeddeded one that mostly minimizes memory footprint.
3. For 32bit systems, there is uncomfortable area of machines with more
than 512M of ram that are still 32bit. Virtual address space may start
become scarce here which cannot happen in 64bits and embedded do not
have that much memory.

> * Should we provide runtime tunables?
Possible, I would also like adding function that passes hints to malloc.

If we know properties that compiler knows (size is constant, pointer is
always freed at end of this function so we could use stack allocation...)
or that we could get by profiling (transient / long-lived objects)
then we could make malloc faster.

I could make a user-space malloc profile-feedback optimizer by
redefining malloc as following macro

#define malloc(x) ({ \
  static struct malloc_hint __h = GET_HINT (x, __FILE__, __LINE__); \
  malloc_hint (x, &__h); \

Then in profile step collect data and create header with GET_HINT that looks like:

#define GET_HINT(x, y) {.transient = (x[0] == '/' && x[1] == 'h' && ... && y == 42) ? 1 :

This will disable optimization that gcc does with malloc (now only
optimizing free (malloc (x)) away) so we need propose some api for

> And then we can see if the idea fits on the current implementation. 
> >> One of the things I really need to get into shape is some kind of
> >> benchmarking for malloc, including some simple micro-benchmarks and
> >> some applications that can be used to test allocator performance. I
> >> would not like to embark on integrating a new allocator without some
> >> really solid benchmarks in place.
> >>
> > Problem with microbenchmarks is that you would repeat errors of
> > allocator research in 60's where allocators were written on
> > microbenchmarks that did not match reality and real performance suffered.
> >
> > When reading papers about allocators you must check if they make sense,
> > take this one, which measures fragmentation.
> >
> >
> >
> > This paper looks good in principle, they collect allocator/free data and
> > replay it. Still there is a flaw that they eliminate allocator overhead by 
> > calling malloc (size - overhead).
> >
> > This makes allocator behave differently than in original program and
> > could theoreticaly change result. A proper way would be adding a counter
> > that tracks overhead of allocator and subtract that in calculations.
> >
> >
> > For benchmarking malloc you need to measure a running time and total
> > space usage of programs to get reliable results, microbenchmarks make
> > sense when you are sure that your optimalization does not change factors
> > that they cannot measure.
> One idea is to use the tcmalloc defect tracker that contains some synthetic programs
> that were used to fix some pattern compared to GLIBC. It may provide an initial
> benchset.
Also collecting a gcc, bash and other program traces. These will get
better coverage. I do not know yet how to collect and replay multithread

> To measure the program space usage, should we focus on both virtual size or just RSS?
> This will require analysis like peak usage and mean usage I think. 

Virtual size depends how much we want to focus on 32bit systems with
more than 512M of memory, otherwise higher virtual size does not cause
much problems.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]