11308 – aggregate operations for @variance, @skew, @kurtosis

Bug 11308 - aggregate operations for @variance, @skew, @kurtosis

Summary: aggregate operations for @variance, @skew, @kurtosis

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	runtime (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Unassigned

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-02-22 16:48 UTC by Frank Ch. Eigler
Modified:	2016-09-08 13:18 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
proposed patch (1.50 KB, patch) 2016-06-03 11:41 UTC, Martin Cermak	Details \| Diff
working version of a patch (2.06 KB, patch) 2016-06-28 11:57 UTC, Martin Cermak	Details \| Diff
a little testing script (311 bytes, text/x-python) 2016-06-28 11:58 UTC, Martin Cermak	Details
updated working version of a patch (1.98 KB, patch) 2016-08-01 14:37 UTC, Martin Cermak	Details \| Diff
improved test program (755 bytes, text/x-python) 2016-08-01 14:40 UTC, Martin Cermak	Details
updated working version of the patch (3.54 KB, patch) 2016-08-03 17:16 UTC, Martin Cermak	Details \| Diff
The attached patch is just being tested. (44.59 KB, application/x-xz) 2016-09-07 17:54 UTC, Martin Cermak	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Frank Ch. Eigler 2010-02-22 16:48:32 UTC

These higher order statistics can be calculated on-line,
with a few more state values.  See e.g.

http://cpansearch.perl.org/src/NIDS/Statistics-OnLine-0.02/lib/Statistics/OnLine.pm

Comment 1 Martin Cermak 2016-06-03 11:41:28 UTC

Created attachment 9311 [details]
proposed patch

The variance of N data points is V = S / (N - 1) where S is the sum of squares of the deviations from the mean.  Here is an attempt to implement @variance() operator using Knuth's algorithm [1]:

=======
def online_variance(data):
    n = 0
    mean = 0.0
    M2 = 0.0
     
    for x in data:
        n += 1
        delta = x - mean
        mean += delta/n
        M2 += delta*(x - mean)

    if n < 2:
        return float('nan')
    else:
        return M2 / (n - 1)
=======

This patch is based on current systemtap implementation of the aggregation operators, which first pre-aggregates the data per each CPU (__stp_stat_add()), and then, when the aggregations are actually being read via e.g. @sum (or @variance), they are aggregated again, this time across all the CPUs (_stp_stat_get()) and outputted.  This approach saves shared resources at the collection time.  So, in this patch, per cpu variances are being collected first and then they are being aggregated again across all the CPUs to give the resulting @variance.  The N is assumed to be N >> 1 and so the resulting @variance() is being computed as a simple mean of per-cpu variances.  Integer arithmetic is being used.  With this patch, we get something relatively small for data points closely spread along the mean, and something relatively big for data points widely spread along the mean.  So it passes a rough sanity test:

=======
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } }  probe end { printdln(", ", @count(a), @max(a), @variance(a)) }'
1000, 42, 1
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<42 } for(i=0; i<20; i++) { a<<<99 } }  probe end { printdln(", ", @count(a), @max(a), @variance(a)) }'
1020, 99, 65
# stap -e 'global a probe oneshot { for(i=0; i<1000; i++) { a<<<i } }  probe end { printdln(" ", @count(a), @max(a), @variance(a)) }'
1000 999 332833
# 
=======


-------------------------------------------
[1] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

Comment 2 Frank Ch. Eigler 2016-06-17 23:44:13 UTC

Just came across this:

https://gist.githubusercontent.com/kastnerkyle/f5ddbb9b647ba21577e5/raw/62cb0de74e4a8dff2732277e7e547e7715c18a1a/streaming_variance.py

and this:

http://web.cse.ohio-state.edu/~kamatn/variance.pdf

I'll do some more digging; maybe we have some even better algorithms available

Comment 3 Martin Cermak 2016-06-28 11:57:16 UTC

Created attachment 9369 [details]
working version of a patch

Attached patch computes the individual per CPU variances using the Knuth's algorithm from Comment #1. Based on that, the aggregated variance over all the CPUs is being computed using the "Total Variance" formula from the above paper.

This gives somewhat reasonable results for a "few" "small" integers with normal distribution, but almost any other set of values makes it behave crazily because of the integer arithmetic being used for the dividing.  Below I am going to attach a little python script that helps comparing this stap variance implementation with python's statistics.variance().

At the first glance, the floating point arithmetic inside the linux kernel doesn't look like something usual or straightforward.  But an attempt to implement it might be an interesting one.  Not sure about this though.

Comment 4 Martin Cermak 2016-06-28 11:58:08 UTC

Created attachment 9370 [details]
a little testing script

Comment 5 Frank Ch. Eigler 2016-06-28 15:13:44 UTC

Excellent progress!

> This gives somewhat reasonable results for a "few" "small" integers with
> normal distribution, but almost any other set of values makes it behave
> crazily because of the integer arithmetic being used for the dividing. 

I suspect the loss of precision is occurring in the per-cpu sd->variance = stp_div64() ... code rather than the cross-cpu agg->variance one.  Your test case (probe oneshot) runs only on one CPU, so a lot of the cross-cpu terms (S1) should be zero.


> At the first glance, the floating point arithmetic inside the linux kernel
> doesn't look like something usual or straightforward.  But an attempt to
> implement it might be an interesting one.  Not sure about this though.

We'll eventually get -some- FP capabilities for stap code, but let's see how much farther we can get without.

If you have any more time/interest in this problem, I'd suggest investigating whether we could track the sd->avg / sd->_M2 / sd->variance values in a scaled form.  We continuously track max/min, so have a good idea about the dynamic range of the <<<'d values.  The code could track a "shift" parameter that scales those numbers (via << shift) during the online update phase so as to preserve as much precision as possible.


Another thing we should consider afterwards is bug #10234.  For a stap script not interested in @variance, this code should not be run at all.

Comment 6 Martin Cermak 2016-08-01 14:37:50 UTC

Created attachment 9413 [details]
updated working version of a patch

Introduced the "shift" parameter (and fixed some errors).  Now the resulting values correspond to those of python's statistics.variance() for all my tests (more than one CPU core is effectively involved). Wheee!

Small shifts like 1, 2, or 3 significantly help.  Higher shifts bring slightly more precision, but increase the risk of overflow.

Now the "shift" could either be updated on the fly, or it could be passed to stap as a -DPARAM.  Given that small shifts help so nicely, ITSM the latter might be a good trade-off, but not sure.

Comment 7 Martin Cermak 2016-08-01 14:40:10 UTC

Created attachment 9414 [details]
improved test program

Attached test program distributes the computations across available CPU cores.

Comment 8 Frank Ch. Eigler 2016-08-01 15:36:50 UTC

Comment on attachment 9413 [details]
updated working version of a patch

considered making the shift value a parameter for the @variance() operator, kind of like @hist_linear()'s?  i.e.,  @variance(var,5) to suggest a shift of 2<<5 to the runtime for that variable?

Comment 9 Martin Cermak 2016-08-03 17:16:19 UTC

Created attachment 9421 [details]
updated working version of the patch

Introduced optional parameter for specifying the bit shift, so that both @variance(var) and @variance(var, N) are valid.  Additionally -DSTAT_BIT_SHIFT can be used, and has actual effect when N is not specified or is zero.

Now I plan to listen to the possible code review, clean the code up, fix a bug residing in the bit shift being shared among all the stat operators (rewriting one another ;-)  Write a dejagnu testcase, and do more testing.

Comment 10 Martin Cermak 2016-09-07 17:54:47 UTC

Created attachment 9490 [details]
The attached patch is just being tested.

Comment 11 Martin Cermak 2016-09-08 13:10:47 UTC

The @variance() part pushed as commit 63ead7fa4ed58e12efdff1b1ab59e6056c277371 .

Comment 12 Frank Ch. Eigler 2016-09-08 13:18:15 UTC

Let's call it a day at @variance.