Design, goals and non-goals of GLIBC benchmarking support.
At the moment this page only relates to wrapping existing tests into benchmarks by extending test-skeleton framework. Dedicated sub-system- and micro- benchmarks are valuable and are welcome.
Design
- Benchmarks will run a fixed amount of time, say 30 seconds. This will be controlled by a timer in test-skeleton.c similar to what is used to handle tests that time-out. We cannot afford to significantly extend the time it takes GLIBC testsuite to run. The benchmark body is run in "while (run_benchmark) {}" loop with an alarm set to BENCHMARK_TIME seconds (30-120 seconds). When the alarm goes off the signal handler sets "run_benchmark = false" and stops the execution of the benchmark, but allowing it to finish the current iteration. For most benchmarks one can get reasonably-precise results from a 30-120 second run. Runs below 10 seconds will have too much of startup/warmup error. Runs above 120 seconds will be just wasting CPU time (assuming that the benchmark body can execute within 1-5 seconds, so that we get well-averaged results).
- We need to collect performance data automatically and allow users to easily submit data from their test runs. One possibility is to add a dedicated git repository for the benchmark (and, possibly, other test) results. A testsuite run will check out the performance data, appends to it after a run (or replace it and rely on git history for historic results), and push the performance data out.
- Test-skeleton.c will be executing do_test() function in a loop, and the number of executions divided by the time the executions took will be benchmark's score.
- There is no goal to have benchmark score comparable between different systems. Benchmark scores will be meaningful only within a single system/setup. The goal is to provide supporting arguments for patch submissions and regression tracking. For a second time in two years I'm looking at what in GLIBC has caused a 50% slowdown on a customer benchmark -- that's overall benchmark slowdown, which means something in GLIBC got 5+ times slower. Surely we can do better.
- For tests that can handle dummy implementations of GLIBC functions (i.e., function always returning the same value or otherwise being very fast) there will be a "training" mode to calculate the cost of the testing code / loops / etc that is _not_ the functions being benchmarked. After running the benchmark in train mode for, say, 5 seconds the functions will be switched from dummy to the real ones and benchmarked. Such approach will allow to precisely estimate performance of just the GLIBC functions of interest, removing the overhead of the testing/benchmarking harness from the overall benchmark score.