This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead
- From: Mel Gorman <mgorman at suse dot de>
- To: Carlos O'Donell <carlos at redhat dot com>
- Cc: Rik van Riel <riel at redhat dot com>, KOSAKI Motohiro <kosaki dot motohiro at gmail dot com>, Konstantin Serebryany <kcc at google dot com>, Minchan Kim <minchan dot kim at gmail dot com>, libc-alpha at sourceware dot org
- Date: Tue, 10 Feb 2015 20:24:48 +0000
- Subject: Re: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead
- Authentication-results: sourceware.org; auth=none
- References: <20150209140608 dot GD2395 at suse dot de> <54D91E06 dot 7060603 at redhat dot com> <20150209224947 dot GA21275 at suse dot de> <54DA25D0 dot 3050501 at redhat dot com> <20150210165414 dot GF2395 at suse dot de> <54DA5599 dot 2070804 at redhat dot com>
On Tue, Feb 10, 2015 at 02:01:45PM -0500, Carlos O'Donell wrote:
> >> Thus, instead of adding more complex heuristics to glibc's malloc I think
> > Is it really that complex? It's a basic heuristic comparing two counters
> > that affects the alloc and free slow paths. The checks are neglible in
> > comparison to the necessary work if glibc calls into the kernel.
> It is complex. It's a time-based integral value that is going to
> complicate debugging, reproducers, and users will need to be educated
> as to how it works. We will likely also want a tunnable for it such that
> if it negatively impacts workloads we can disable it. The tunnable though
> adds runtime overhead, and that's all added complexity.
Ok, I cannot argue with that logic.
> Please keep in mind that glibc is a general purpose allocator which aims
> to be good at the average case. Unfortunately to be honest we lack any data
> with which to measure "average" workloads, which is why glibc as a community
> is looking at whole system benchmarks to gather workload data and integrate
> it directly into our microbenchmark (see benchtests/) or a future whole system
> benchmark framework.
Understood. For what it's worth, the kernel memory management has had
numerous similar discussions in the past.
> So while the technical complexity of the fix is low, that's not the only
> factor I measure when looking at a fix.
> A high quality contribution for changes to malloc should IMO include a
> microbenchmark or two to flesh out the matching workload against which the
> change was tested (see benchtests/README).
Ok. In this case, a benchmark exists but I would be hesitant to include it
in benchtests without better evidence that a number of workloads actually
suffer this problem. mariadb hits this during database initialisation but
it's negligible in comparison to the overall cost of that operation.
> >> we should be testing the addition of MADV_FREE in glibc's malloc and then
> >> supporting or adjusting the proposal for MADV_FREE or vrange dependent on
> >> the outcome.
> > They are ortogonal to each other and both can exist side by side. MADV_FREE
> > is cheaper than MADV_DONTNEED but avoiding unnecessary system calls is far
> > cheaper. As for MADV_FREE vs vrange, my expectation is that MADV_FREE will
> > be merged in the current merge window. The patches are sitting in Andrew
> > Morton's mmotm tree but he has not sent a merge request yet.
> They are orthogonal in a technical sense, but they solve the same problem:
> Make glibc's malloc faster for workload X.
> My opinion is that MADV_FREE is a technically superior solution to rate
> limiting madvise() calls from malloc. Rate limiting or dynamic heuristics
> have been difficult to tune and support over the last decade, and I am
> very hesitant to accept more without serious discussion, justification,
> and data.
> >> In the meantime we can talk about mitigating the problems in glibc's
> >> allocator for systems with old kernels, but it should not be the primary
> >> solution. In glibc we would conditionalize the changes against the first
> >> kernel version that included MADV_FREE, and when the minimum supported
> >> kernel version is higher than that we would remove the code in question.
> >> My suggested next steps are:
> >> (a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
> >> MADV_FREE, see how that performs, and inform upstream kernel.
> > The upstream kernel developers in this case won't care what the performance
> > is (changelog is already set) although Michan would care if there was any
> > bug reports associated with its usage. MADV_FREE is likely to be supported
> > either way and it's up to the glibc folk whether they want to support it.
> We do want to support it. I also believe that upstream does care.
If glibc hits bug reports when MADV_FREE is used, they will *definitely*
> >> (b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
> >> measure. Before we go any further here, please increase M_TRIM_THRESHOLD
> >> in ebizzy and see if that makes a difference? It should make a difference
> >> by increasing the threshold at which we trim back to the OS, both sbrk,
> >> and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
> >> memory pressure. Changing the default though is not a trivial thing, since
> >> it could lead to immediate OOM for existing applications that run close to
> >> the limit of RAM. Discussion and analysis will be required.
> > Altering trim threshold does not appear to help but that's not really
> > surprising considering that it's only applied the main arena and not the
> > per-thread heaps. At least that is how I'm interpreting this check
> I agree. It looks like a bug to me. The arenas should from first principles
> all behave relatively the same, regardless of the underlying storage e.g.
> brk vs. mmap.
Cool. By co-incidence it happens to solve the problem for this
particular workload but the milage will vary considerable. At least it
can be tuned around without disabling per-thread heaps entirely/
> > You may have noticed that part of the patch takes mp_.mmap_threshold into
> > account in the shrinking decision but this was the wrong choice. I should
> > have at least used the trim threshold there. I had considered just using
> > trim threshold but worried that the glibc developers would not like it
> > as it requred manual tuning by the user.
> Please do not hesitate to ask an opinion on the list, like you are doing
> right now with your RFC. Just ask if there is a preference, choice, or
> rationale for using or not using trim_threshold.
> I see algorithm development going through various phases:
> (a) Implement algorithm, add tunnables.
> (b) Evaluate algorithm against workloads and tweak tunnables.
> (c) Develop dynamic tunning and test.
> (e) Replace tunnable with dynamic tunning.
> With glibc's malloc we've been at (b) for years. I hope that with some
> of the tunnables unification work we're doing that we can make (b),
> (c) and (d) easier for the core developers.
Thanks for the clarification.
> > If trim threshold was to be used as a workaround then I'd think we'd need
> > at this patch. Anyone using the ebizzy benchmark without knowing this will
> > still report a regression between distros with newer glibcs but at least
> > a google search might find this thread.
> The trimming threshold is another solution.
> Limiting the trimming should reduce the number of MADV_DONTNEED calls,
> and make all of the arenas behave more uniformly.
> > diff --git a/malloc/arena.c b/malloc/arena.c
> > index 886defb074a2..a78d4835a825 100644
> > --- a/malloc/arena.c
> > +++ b/malloc/arena.c
> > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
> > }
> > top_size = chunksize (top_chunk);
> > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> > - if (extra < (long) pagesz)
> > + if (extra < (long) mp_.trim_threshold)
> > return 0;
> > /* Try to shrink. */
> This patch looks correct to me. I think it's a bug that non-main arenas
> don't use the trim threshold. All arenas should behave the same way,
> and thus make them easier to debug and use.
> How does the patch impact your testing?
ebizzy gets fixed by co-incidence. Another test case that has a mix of
mallocs and frees shows a reduction in the number of calls to madvise by
default and it's possible to eliminate it entirely.
> I might try this out in Fedora Rawhide for a few months.
> Would you post another patch with just this change and your test
No problem. It'll be posted within the next few minutes. The changelog
may be considered excessive but feel free to trim it (ba dum tisch, I'm
here all week).