This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v2] benchtests: Add malloc microbenchmark
- From: Will Newton <will dot newton at linaro dot org>
- To: OndÅej BÃlka <neleai at seznam dot cz>
- Cc: Siddhesh Poyarekar <siddhesh at redhat dot com>, libc-alpha <libc-alpha at sourceware dot org>
- Date: Tue, 10 Jun 2014 08:47:36 +0100
- Subject: Re: [PATCH v2] benchtests: Add malloc microbenchmark
- Authentication-results: sourceware.org; auth=none
- References: <1397737835-15868-1-git-send-email-will dot newton at linaro dot org> <20140530094508 dot GQ12497 at spoyarek dot pnq dot redhat dot com> <CANu=Dmh=pdfj088kHQ9-eqaFmXKQN=bkQMjrC851EHjc_G3sPg at mail dot gmail dot com> <20140609163753 dot GI24899 at spoyarek dot pnq dot redhat dot com> <20140609203326 dot GA5396 at domone dot podge>
On 9 June 2014 21:33, OndÅej BÃlka <email@example.com> wrote:
> On Mon, Jun 09, 2014 at 10:07:53PM +0530, Siddhesh Poyarekar wrote:
>> On Mon, Jun 09, 2014 at 04:14:35PM +0100, Will Newton wrote:
>> > > A maximum of 32K only tests arena allocation performance. This is
>> > > fine for now since malloc+mmap performance is as interesting. What is
>> > There's at least two axes we are interested in - how performance
>> > scales with the number of threads and how performance scales with the
>> > allocation size. For thread performance (which this benchmark is
>> > about) the larger allocations are not so interesting - typically their
>> > locking overhead is in the kernel rather than userland and in terms of
>> > real world application performance its just not as likely to be a
>> > bottleneck as small allocations. We have to be pragmatic in which
>> > choices we make as the full matrix of threads versus allocation sizes
>> > would be pretty huge.
>> Heh, I noticed my typo now - I meant to say that malloc+mmap
>> performance is *not* as interesting :)
> Problem is that this benchmark does not measure a multithread
> performance well. Just spawning many threads does not say much, my guess
> is that locking will quicky cause convergence to state where at each
> core a thread with separate arena is running. Also it does not measure
> hard case when you allocate memory in one thread.
> I looked on multithread benchmark and it has additional flaws:
> Big variance, running time varies around by 10% accoss iterations,
> depending on how kernel schedules these. Running threads and measuring
> time after you join them measures a slowest thread so at end some cores
> are idle.
Thanks for the suggestion, I will look into this.
> Bad units, when I run a benchmark then with one benchmark a mean is:
> "mean": 91.605,
> However when we run 32 threads then it looks that it speeds malloc
> around three times:
> "mean": 28.5883,
What is wrong with that? I assume you have a multi-core system, would
you not expect more threads to have higher throughput?
>> > So I guess I should probably also write a benchmark for allocation
>> > size for glibc as well...
>> Yes, it would be a separate benchmark and probably would need some
>> specific allocation patterns rather than random sizes. Of course
>> choosing allocation patterns is not going to be easy.
> No, that was a benchmark that I posted which measured exactly what
> happens at given sizes.
>> > > I don't know how useful max_rss would be since we're only doing a
>> > > malloc and never really writing anything to the allocated memory.
>> > > Smaller sizes may probably result in actual page allocation since we
>> > > write to the chunk headers, but probably not so for larger sizes.
>> > Yes, it is slightly problematic. What you probably want to to do is
>> > zero all the memory and measure RSS at that point but it would slow
>> > down the benchmark and spend lots of time in memset instead. At the
>> > moment it tells you how many pages are taken up by book-keeping but
>> > not how many of those pages your application would touch anyway.
>> Oh I didn't mean to imply that we zero pages and try to get a more
>> accurate RSS value. My point was that we could probably just do away
>> with it completely because it doesn't really tell us much - I can't
>> see how pages taken up by book-keeping would be useful.
>> However if you do want to show resource usage, then address space
>> usage (VSZ) might show scary numbers due to the per-thread arenas, but
>> they would be much more representative. Also, it might be useful to
>> see how address space usage scales with threads, especially for
> Still this would be worse than useless as it would vary wildly from real
> behaviour (for example it is typical that when there are allocations in
> quick succession then they will likely also deallocated in quick
> sucession.) and that would cause us implement something that actually
> increases memory usage. It happened in 70's so do not repeat this
>> > No I haven't looked into that, so far I have been treating malloc as a
>> > black box and I'm hoping not to tailor teh benchmark too far to one
>> > implementation or another.
>> I agree that the benchmark should not be tailored to the current
>> implementation, but then this behaviour would essentially be another
>> set of inputs. Simply increasing the maximum size from 32K to about
>> 128K (that's the initial threshold for mmap anyway) might result in
>> that behaviour being triggered more frequently.
> For malloc you need to benchmarks satisfy some conditions to be
> meaningful. When you compare different implementations one could use
> different memory allocation pattern. That could cause additional cache
> misses that dominate performance but you do not measure it in benchmark.
> Treating malloc as black-box kinda defeats a purpose.
Toolchain Working Group, Linaro