This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead

On Tue, Feb 10, 2015 at 10:37:52AM -0500, Carlos O'Donell wrote:
> On 02/09/2015 05:49 PM, Mel Gorman wrote:
> > I would also welcome suggestions on how madvise could be throttled without
> > the use of counters. The counters are heap-local where I do not expect
> > there will be cache conflicts and the allocation-side counter is only
> > updated after a recent heap shrink to minimise updates.
> > 
> > Initially I worked around this in the kernel but any solution there
> > breaks the existing semantics of MADV_DONTNEED and was rejected. See
> > last paragraph of .
> The truth is that glibc doesn't want to use MADV_DONTNEED in malloc,
> but it's the only interface we have right now that has similar
> semantics to what we need.
> Similarly Kostya from google told me that ASAN also has this problem since 
> it has tons of statistics pages that it will touch soon, but doesn't care
> if they come back as zero or the original data.
> Two years ago I spoke with Motohiro-san and we talked about
> MADV_"Want but don't need", but no mainline solution was present at the
> time.
> The immediate work that comes to mind is the `vrange` work by Minchan Kim.

Yup, I was around for most of the volatile ranges discussions although I
tended to stay away from that particular one.

> I agree with Rik's comment in the above link that we really want MADV_FREE
> in these cases, and it gives a 200% speedup over MADV_DONTNEED (as reported
> by testers using jemalloc patches).

It's true that MADV_FREE will be faster than MADV_DONTNEED in this case
as the pages do not need to be reallocated. That does not make it a
cheap operation and using MADV_FREE thousands of times per second in the
worst-case is still going to hurt. MADV_FREE still incurs a page table
protection update, TLB flush and if it's reused then it's another minor
fault. In the absense of memory pressure, the main gain is that you do
not have to refault, realloc and zero the pages.

> Thus, instead of adding more complex heuristics to glibc's malloc I think

Is it really that complex? It's a basic heuristic comparing two counters
that affects the alloc and free slow paths. The checks are neglible in
comparison to the necessary work if glibc calls into the kernel.

> we should be testing the addition of MADV_FREE in glibc's malloc and then
> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
> the outcome.

They are ortogonal to each other and both can exist side by side. MADV_FREE
is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
be merged in the current merge window. The patches are sitting in Andrew
Morton's mmotm tree but he has not sent a merge request yet.

> In the meantime we can talk about mitigating the problems in glibc's
> allocator for systems with old kernels, but it should not be the primary
> solution. In glibc we would conditionalize the changes against the first
> kernel version that included MADV_FREE, and when the minimum supported
> kernel version is higher than that we would remove the code in question.
> My suggested next steps are:
> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
>     MADV_FREE, see how that performs, and inform upstream kernel.

The upstream kernel developers in this case won't care what the performance
is (changelog is already set) although Michan would care if there was any
bug reports associated with its usage. MADV_FREE is likely to be supported
either way and it's up to the glibc folk whether they want to support it.

> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
>     measure. Before we go any further here, please increase M_TRIM_THRESHOLD
>     in ebizzy and see if that makes a difference? It should make a difference
>     by increasing the threshold at which we trim back to the OS, both sbrk,
>     and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
>     memory pressure. Changing the default though is not a trivial thing, since
>     it could lead to immediate OOM for existing applications that run close to
>     the limit of RAM. Discussion and analysis will be required.

Altering trim threshold does not appear to help but that's not really
surprising considering that it's only applied the main arena and not the
per-thread heaps. At least that is how I'm interpreting this check

      if (av == &main_arena) {
        if ((unsigned long)(chunksize(av->top)) >=
            (unsigned long)(mp_.trim_threshold))
          systrim(mp_.top_pad, av);
      } else {
        /* Always try heap_trim(), even if the top chunk is not
           large, because the corresponding heap might go away.  */
        heap_info *heap = heap_for_ptr(top(av));

        assert(heap->ar_ptr == av);
        heap_trim(heap, mp_.top_pad);

In the case of the per-thread heaps, shrinking decisions are based on the
page size made due to this check.

 extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
  if (extra < (long) pagesz)
    return 0;

  /* Try to shrink. */
  if (shrink_heap (heap, extra) != 0)
    return 0;

You may have noticed that part of the patch takes mp_.mmap_threshold into
account in the shrinking decision but this was the wrong choice. I should
have at least used the trim threshold there.  I had considered just using
trim threshold but worried that the glibc developers would not like it
as it requred manual tuning by the user.

If trim threshold was to be used as a workaround then I'd think we'd need
at this patch. Anyone using the ebizzy benchmark without knowing this will
still report a regression between distros with newer glibcs but at least
a google search might find this thread.

diff --git a/malloc/arena.c b/malloc/arena.c
index 886defb074a2..a78d4835a825 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
   top_size = chunksize (top_chunk);
   extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
-  if (extra < (long) pagesz)
+  if (extra < (long) mp_.trim_threshold)
     return 0;
   /* Try to shrink. */

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]