This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: glibc benchmarks' results can be unreliable for short runtimes (on Aarch64)
- From: Anton Youdkevitch <anton dot youdkevitch at bell-sw dot com>
- To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, "libc-alpha at sourceware dot org" <libc-alpha at sourceware dot org>
- Cc: nd <nd at arm dot com>
- Date: Mon, 24 Jun 2019 09:52:53 +0200
- Subject: Re: glibc benchmarks' results can be unreliable for short runtimes (on Aarch64)
- References: <VI1PR0801MB2127DC882459BC63DA318B3983E70@VI1PR0801MB2127.eurprd08.prod.outlook.com>
Wilco,
On 6/21/2019 2:01 PM, Wilco Dijkstra wrote:
Hi Anton,
Recently I was doing an optimized implementation of memcpy/memmove or
TX2. While running internal microbenchmarks I noticed that for the
"fast" benchmarks (~10ms runtime) the results vary quite
significantly across runs (5%-20%). It is possible to find two runs
that show my implementation actually significantly worsened the
performance. Also there are (quite common) cases when the "baseline"
implementation gets worse and the "tested" implementation gets better
(or vice versa) across the runs.
Yes this is certainly possible for any short running benchmark, which
is why I recently increased the minimum iteration count 128 times. I
ran it on a fixed frequency server and got quite stable results.
However if your CPU does frequency scaling then 10ms is likely too
short for consistent results.
I think that we can assume frequency throttling to be a general rule
these days.
The first solution to this that comes to mind is to increase the
runtime for the "fast" benchmarks. If I increase bench-memcpy runtime
32x (the actual runtime for TX2 would be ~2s) the results for a
particular implementation are always within 5% range. The effect of
one benchmark gains and another one loses for different runs while
not as significant still remains. So, are there any reasons not to
bumping up the runtime of the "fast" benchmarks to 1s-2s?
1 second per benchmark sounds reasonable, however if you just increase
INNER_LOOP_ITERS a lot then various benchmarks will become way too
slow. So you may need to move them to INNER_LOOP_ITERS_MEDIUM or
something similar. If you use "time $(run-bench)" in the benchtests
makefile it prints out the time for each benchmark.
OK, I understand this, thanks. I will use INNER_LOOP_ITERS_MEDIUM then.
--
Thanks,
Anton