This is the mail archive of the
mailing list for the glibc project.
Re: [PATCH v3] benchtests: Add malloc microbenchmark
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Will Newton <will dot newton at linaro dot org>
- Cc: libc-alpha <libc-alpha at sourceware dot org>
- Date: Wed, 25 Jun 2014 18:06:09 +0200
- Subject: Re: [PATCH v3] benchtests: Add malloc microbenchmark
- Authentication-results: sourceware.org; auth=none
- References: <1403196368-26785-1-git-send-email-will dot newton at linaro dot org> <20140625092926 dot GA28367 at domone dot podge> <CANu=DmgY1ZZODXSMhnM4ajNQzv3YJSOH_6EgCbcXtnoymPRt7g at mail dot gmail dot com>
On Wed, Jun 25, 2014 at 10:39:24AM +0100, Will Newton wrote:
> On 25 June 2014 10:29, OndÅej BÃlka <firstname.lastname@example.org> wrote:
> > On Thu, Jun 19, 2014 at 05:46:08PM +0100, Will Newton wrote:
> >> Add a microbenchmark for measuring malloc and free performance with
> >> varying numbers of threads. The benchmark allocates and frees buffers
> >> of random sizes in a random order and measures the overall execution
> >> time and RSS. Variants of the benchmark are run with 1, 4, 8 and
> >> 16 threads.
> >> The random block sizes used follow an inverse square distribution
> >> which is intended to mimic the behaviour of real applications which
> >> tend to allocate many more small blocks than large ones.
> >> ChangeLog:
> >> 2014-06-19 Will Newton <email@example.com>
> >> * benchtests/Makefile: (bench-malloc): Add malloc thread
> >> scalability benchmark.
> >> * benchtests/bench-malloc-threads.c: New file.
> >> ---
> >> benchtests/Makefile | 20 ++-
> >> benchtests/bench-malloc-thread.c | 299 +++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 316 insertions(+), 3 deletions(-)
> >> create mode 100644 benchtests/bench-malloc-thread.c
> >> Changes in v3:
> >> - Single executable that takes a parameter for thread count
> >> - Run for a fixed duration rather than a fixed number of loops
> >> - Other fixes in response to review suggestions
> >> Example of a plot of the results versus tcmalloc and jemalloc on
> >> a 4 core i5:
> >> http://people.linaro.org/~will.newton/bench-malloc-threads.png
> > That graph looks interesting. It is little weird that in libc a 2 and
> > three thread take nearly same time but not when you use four thread one.
> > For other allocators a dependency is linear. How could you explain that?
> I expected to potentially see two inflection points in the curve. One
> due to the single thread optimization in glibc that will make the
> single threaded case disproportionally faster. I also expected to see
> some kind of indication that I had run out of free CPU cores (and thus
> context switch overhead increases). I ran the test on a 4 core i5
> (hyper-threaded). I believe that's what is visible here:
> 1. Single threaded disproportionally faster
> 2. Curve gradient is lower from 1 -> number of cores (and this seems
> to be visible in at least tcmalloc as well)
> 3. Curve gradient increases and remains roughly constant above number of cores
Still it does not explain a 2 threads case, what is your explanation?
Also a single thread case is because you have a special-cased a one
thread scenario. If you used a created thread instead of main then you
would get same performance as in multithread scenario. Also if you
decided to use a main thread as one of benchmarked you would get a
different performance graph. In my opinion its easier if you go for
multhread characteristic just omit that distictions.
I still think that this patch is not useful, as if you instead did a
fork followed by thread creation to use non-main arena you would get
same preformance characteristic. And measuring a single-thread program
that switches thread will give us exactly same information.
As graph is concerned it will be more complicated because of hardware
quirks rather than some problems in implementation.
One culprit here is hyperthreading that will cause some threads to share a
core and both will run slower which modifies shape for 4-8 cores.
There could be second problem if you decided to have working set larger
than L2 cache size it would also skew results but it does not seem to be