This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH] Add malloc micro benchmark
- From: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>
- To: Carlos O'Donell <carlos at redhat dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Mon, 18 Dec 2017 15:18:47 +0000
- Subject: Re: [PATCH] Add malloc micro benchmark
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco dot Dijkstra at arm dot com;
- Nodisclaimer: True
- References: <DB6PR0801MB205334E9F59632F30A17202683390@DB6PR0801MB2053.eurprd08.prod.outlook.com>,<firstname.lastname@example.org>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Carlos O'Donell wrote:
Thanks for the review!
> This test is a long time coming and is a great idea.
> My "big" question here is: What are we trying to model?
> Do we want to prove that the single threaded optimizations you
> added are helping a given size class of allocations?
Yes that is the main goal of the benchmark. It models the allocation
pattern of a few benchmarks which were reported as being slow
despite the new tcache (which didn't show any gains).
When the tcache was configured to be larger there was a major
speedup, suggesting that the tcache doesn't work on patterns with
a high number of (de)allocations of similar sized blocks. Since DJ
didn't seem keen on increasing the tcache size despite it showing
major gains across a wide range of benchmarks, I decided to fix
the performance for the single-threaded case at least. It's now 2.5x
faster on a few sever benchmarks (of course the next question is
whether tcache is actually useful in its current form).
> You are currently modeling a workload that has increasing
> memory size requests and in some ways this is an odd workload
> that has high external fragmentation characteristics. For example
> after allocating lots of 256 byte blocks we move on to 1024 byte
> blocks, with the latter being unusable unless we coalesce.
I'm assuming coalescing works as expected. If it doesn't, it would
be a nasty bug.
> I *wish* we could test main_arena vs. threaded arena, since they
> have different code and behave differently e.g. sbrk vs. mmap'd
I'd have to check how easy it is to force it to use the thread arena.
The whole thing is just crazily weird, with too many different code
paths and possibilities. It seems much easier just to always use
thread arenas, and perhaps use sbrk only if there is some serious
advantage over mmap. Also it appears all the values are set to
what was perhaps reasonable 10-20 years ago, not today. When
a small server has 128GB, there is absolutely no reason to worry
about returning 128KB to the OS as quickly as possible...
> You need to make this robust against env vars changing malloc
> behaviour. You should use mallopt to change some parameters.
You mean setting the tcache size explicitly (maybe even switching off)?
>> Note something very bad happens for the larger allocations, there
>> is a 25x slowdown from 25 to 400 allocations of 4KB blocks...
> Keep in mind you are testing the performance of sbrk here. In a threaded
> arena, the non-main_arena mmap's a 64MiB heap (on 64-bit) and then
> draws allocations from it. So in some ways main_arena is expenseive,
> but both have to pay a page-touch cost...
> For each 4KiB block you touch the block to write the co-located metadata
> and that forces the kernel to give you a blank page, which you then do
> nothing with. Then you repeat the above again.
> For all other sizes you amortize the cost of the new page among
> several allocations.
> Do you have any other explanation?
Well that looks like a reasonable explanation, but it shows a serious
performance bug - I think we use MADV_DONTNEED which doesn't
work on Linux and will cause all pages to be deallocated, reallocated
and zero-filled... This is the sort of case where you need to be very
careful to amortize over many allocations or long elapsed time, if at
all (many other allocators never give pages back).
> At some point you will hit the mmap threshold and the cost of the
> allocation will skyrocket as you have to call mmap.
That only happens on huge allocations (much larger than 4KB), or when
you run out of sbrk space (unlikely).
> In glibc we have:
> tcache -> fastbins -> smallbins -> largbing -> unordered -> mmap
> If you proceed through from small allocations to larger allocations
> you will create chunks that cannot be used by future allocations.
> In many cases this is a worst case performance bottleneck. The
> heap will contain many 256 byte allocations but these cannot service
> the 1024 bytes, that is unless consolidation has been run. So this
> tests the consolidation as much as anything else, which might not
> trigger because of the free thresholds required.
If consolidation doesn't work that's a serious bug. However allocation
performance should not be affected either way - in a real application
those small blocks might still be allocated. As long as consolidation
runs quickly (generally it's a small percentage in profiles), it won't
affect the results.
> So what are we trying to model here?
> If we want to look at the cost of independent size class allocations
> then we need a clean process and allocate only a given size, and look
> at performance across the number of allocations.
That's certainly feasible if we keep the number of sizes small (less
than the list below). It should be easy to reuse the bench-malloc-thread.c
makefile magic to run the same binary with multiple sizes.
> I would also have much finer grained allocations by powers of 2.
> 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4092 etc. You want
> to see what happens for the allocations which are:
> Would this serve better to show that your single threaded malloc
> changes were helpful for a given size class?
Well I can easily add some of the above sizes, it's highly configurable.
I don't think there will be much difference with the existing sizes though.
> You need to use mallopt to make sure the user's environment
> did not set MALLOC_MMAP_THRESHOLD_ to a value lower than your
> maximum allocation size.
I don't think that is possible given the largest allocation size is 4KB.