These higher order statistics can be calculated on-line, with a few more state values. See e.g. http://cpansearch.perl.org/src/NIDS/Statistics-OnLine-0.02/lib/Statistics/OnLine.pm
Created attachment 9311 [details] proposed patch The variance of N data points is V = S / (N - 1) where S is the sum of squares of the deviations from the mean. Here is an attempt to implement @variance() operator using Knuth's algorithm [1]: ======= def online_variance(data): n = 0 mean = 0.0 M2 = 0.0 for x in data: n += 1 delta = x - mean mean += delta/n M2 += delta*(x - mean) if n < 2: return float('nan') else: return M2 / (n - 1) ======= This patch is based on current systemtap implementation of the aggregation operators, which first pre-aggregates the data per each CPU (__stp_stat_add()), and then, when the aggregations are actually being read via e.g. @sum (or @variance), they are aggregated again, this time across all the CPUs (_stp_stat_get()) and outputted. This approach saves shared resources at the collection time. So, in this patch, per cpu variances are being collected first and then they are being aggregated again across all the CPUs to give the resulting @variance. The N is assumed to be N >> 1 and so the resulting @variance() is being computed as a simple mean of per-cpu variances. Integer arithmetic is being used. With this patch, we get something relatively small for data points closely spread along the mean, and something relatively big for data points widely spread along the mean. So it passes a rough sanity test: ======= # stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } } probe end { printdln(", ", @count(a), @max(a), @variance(a)) }' 1000, 42, 1 # stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } for(i=0; i<20; i++) { a<<<99 } } probe end { printdln(", ", @count(a), @max(a), @variance(a)) }' 1020, 99, 65 # stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<i } } probe end { printdln(" ", @count(a), @max(a), @variance(a)) }' 1000 999 332833 # ======= ------------------------------------------- [1] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
Just came across this: https://gist.githubusercontent.com/kastnerkyle/f5ddbb9b647ba21577e5/raw/62cb0de74e4a8dff2732277e7e547e7715c18a1a/streaming_variance.py and this: http://web.cse.ohio-state.edu/~kamatn/variance.pdf I'll do some more digging; maybe we have some even better algorithms available
Created attachment 9369 [details] working version of a patch Attached patch computes the individual per CPU variances using the Knuth's algorithm from Comment #1. Based on that, the aggregated variance over all the CPUs is being computed using the "Total Variance" formula from the above paper. This gives somewhat reasonable results for a "few" "small" integers with normal distribution, but almost any other set of values makes it behave crazily because of the integer arithmetic being used for the dividing. Below I am going to attach a little python script that helps comparing this stap variance implementation with python's statistics.variance(). At the first glance, the floating point arithmetic inside the linux kernel doesn't look like something usual or straightforward. But an attempt to implement it might be an interesting one. Not sure about this though.
Created attachment 9370 [details] a little testing script
Excellent progress! > This gives somewhat reasonable results for a "few" "small" integers with > normal distribution, but almost any other set of values makes it behave > crazily because of the integer arithmetic being used for the dividing. I suspect the loss of precision is occurring in the per-cpu sd->variance = stp_div64() ... code rather than the cross-cpu agg->variance one. Your test case (probe oneshot) runs only on one CPU, so a lot of the cross-cpu terms (S1) should be zero. > At the first glance, the floating point arithmetic inside the linux kernel > doesn't look like something usual or straightforward. But an attempt to > implement it might be an interesting one. Not sure about this though. We'll eventually get -some- FP capabilities for stap code, but let's see how much farther we can get without. If you have any more time/interest in this problem, I'd suggest investigating whether we could track the sd->avg / sd->_M2 / sd->variance values in a scaled form. We continuously track max/min, so have a good idea about the dynamic range of the <<<'d values. The code could track a "shift" parameter that scales those numbers (via << shift) during the online update phase so as to preserve as much precision as possible. Another thing we should consider afterwards is bug #10234. For a stap script not interested in @variance, this code should not be run at all.
Created attachment 9413 [details] updated working version of a patch Introduced the "shift" parameter (and fixed some errors). Now the resulting values correspond to those of python's statistics.variance() for all my tests (more than one CPU core is effectively involved). Wheee! Small shifts like 1, 2, or 3 significantly help. Higher shifts bring slightly more precision, but increase the risk of overflow. Now the "shift" could either be updated on the fly, or it could be passed to stap as a -DPARAM. Given that small shifts help so nicely, ITSM the latter might be a good trade-off, but not sure.
Created attachment 9414 [details] improved test program Attached test program distributes the computations across available CPU cores.
Comment on attachment 9413 [details] updated working version of a patch considered making the shift value a parameter for the @variance() operator, kind of like @hist_linear()'s? i.e., @variance(var,5) to suggest a shift of 2<<5 to the runtime for that variable?
Created attachment 9421 [details] updated working version of the patch Introduced optional parameter for specifying the bit shift, so that both @variance(var) and @variance(var, N) are valid. Additionally -DSTAT_BIT_SHIFT can be used, and has actual effect when N is not specified or is zero. Now I plan to listen to the possible code review, clean the code up, fix a bug residing in the bit shift being shared among all the stat operators (rewriting one another ;-) Write a dejagnu testcase, and do more testing.
Created attachment 9490 [details] The attached patch is just being tested.
The @variance() part pushed as commit 63ead7fa4ed58e12efdff1b1ab59e6056c277371 .
Let's call it a day at @variance.