This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] benchtests: Add malloc microbenchmark
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Siddhesh Poyarekar <siddhesh at redhat dot com>
- Cc: Will Newton <will dot newton at linaro dot org>, libc-alpha <libc-alpha at sourceware dot org>
- Date: Mon, 9 Jun 2014 22:33:26 +0200
- Subject: Re: [PATCH v2] benchtests: Add malloc microbenchmark
- Authentication-results: sourceware.org; auth=none
- References: <1397737835-15868-1-git-send-email-will dot newton at linaro dot org> <20140530094508 dot GQ12497 at spoyarek dot pnq dot redhat dot com> <CANu=Dmh=pdfj088kHQ9-eqaFmXKQN=bkQMjrC851EHjc_G3sPg at mail dot gmail dot com> <20140609163753 dot GI24899 at spoyarek dot pnq dot redhat dot com>
On Mon, Jun 09, 2014 at 10:07:53PM +0530, Siddhesh Poyarekar wrote:
> On Mon, Jun 09, 2014 at 04:14:35PM +0100, Will Newton wrote:
> > > A maximum of 32K only tests arena allocation performance. This is
> > > fine for now since malloc+mmap performance is as interesting. What is
> > There's at least two axes we are interested in - how performance
> > scales with the number of threads and how performance scales with the
> > allocation size. For thread performance (which this benchmark is
> > about) the larger allocations are not so interesting - typically their
> > locking overhead is in the kernel rather than userland and in terms of
> > real world application performance its just not as likely to be a
> > bottleneck as small allocations. We have to be pragmatic in which
> > choices we make as the full matrix of threads versus allocation sizes
> > would be pretty huge.
> Heh, I noticed my typo now - I meant to say that malloc+mmap
> performance is *not* as interesting :)
Problem is that this benchmark does not measure a multithread
performance well. Just spawning many threads does not say much, my guess
is that locking will quicky cause convergence to state where at each
core a thread with separate arena is running. Also it does not measure
hard case when you allocate memory in one thread.
I looked on multithread benchmark and it has additional flaws:
Big variance, running time varies around by 10% accoss iterations,
depending on how kernel schedules these. Running threads and measuring
time after you join them measures a slowest thread so at end some cores
Bad units, when I run a benchmark then with one benchmark a mean is:
However when we run 32 threads then it looks that it speeds malloc
around three times:
> > So I guess I should probably also write a benchmark for allocation
> > size for glibc as well...
> Yes, it would be a separate benchmark and probably would need some
> specific allocation patterns rather than random sizes. Of course
> choosing allocation patterns is not going to be easy.
No, that was a benchmark that I posted which measured exactly what
happens at given sizes.
> > > I don't know how useful max_rss would be since we're only doing a
> > > malloc and never really writing anything to the allocated memory.
> > > Smaller sizes may probably result in actual page allocation since we
> > > write to the chunk headers, but probably not so for larger sizes.
> > Yes, it is slightly problematic. What you probably want to to do is
> > zero all the memory and measure RSS at that point but it would slow
> > down the benchmark and spend lots of time in memset instead. At the
> > moment it tells you how many pages are taken up by book-keeping but
> > not how many of those pages your application would touch anyway.
> Oh I didn't mean to imply that we zero pages and try to get a more
> accurate RSS value. My point was that we could probably just do away
> with it completely because it doesn't really tell us much - I can't
> see how pages taken up by book-keeping would be useful.
> However if you do want to show resource usage, then address space
> usage (VSZ) might show scary numbers due to the per-thread arenas, but
> they would be much more representative. Also, it might be useful to
> see how address space usage scales with threads, especially for
Still this would be worse than useless as it would vary wildly from real
behaviour (for example it is typical that when there are allocations in
quick succession then they will likely also deallocated in quick
sucession.) and that would cause us implement something that actually
increases memory usage. It happened in 70's so do not repeat this
> > No I haven't looked into that, so far I have been treating malloc as a
> > black box and I'm hoping not to tailor teh benchmark too far to one
> > implementation or another.
> I agree that the benchmark should not be tailored to the current
> implementation, but then this behaviour would essentially be another
> set of inputs. Simply increasing the maximum size from 32K to about
> 128K (that's the initial threshold for mmap anyway) might result in
> that behaviour being triggered more frequently.
For malloc you need to benchmarks satisfy some conditions to be
meaningful. When you compare different implementations one could use
different memory allocation pattern. That could cause additional cache
misses that dominate performance but you do not measure it in benchmark.
Treating malloc as black-box kinda defeats a purpose.