On IRC, soren reported many skipped probes, which we determined to be because the collection probes were waiting for a lock held by a slow reporting probe. This simplified script can show the long delays: global vfs_write_bytes probe timer.s(1) { for (i=0; i<400; ++i) vfs_write_bytes[i, sprint(i), sprint(i*i)] <<< i } probe timer.s(1) { foreach ([uid,fsname,devname] in vfs_write_bytes) { printf("vfs_write_bytes[uid=%d,fsname=%s,devname=%s] = %d\n", uid, fsname, devname, @sum(vfs_write_bytes[uid,fsname,devname])) printf("vfs_write_calls[uid=%d,fsname=%s,devname=%s] = %d\n", uid, fsname, devname, @count(vfs_write_bytes[uid,fsname,devname])) } delete vfs_write_bytes } "stap -t" shows the reporting loop to take well over 1ms for just 400 entries (thus 800 lines). I experimented with commenting out printfs, and I estimate that the overhead is about 20% in the foreach aggregation and 40% each for the printfs. Changing the key tuple from (long,string,string) to (long,long,long) didn't seem to have any measurable effect. Bumping TRYLOCKDELAY to 20000 (20us) fixed the problem for soren, such that the collection probes are now willing to wait long enough for the reporter for finish. I think in the general case this is too long, but at least our tunables were able to make it work. We bounced around many ideas on how we should improve, which I'll try to summarize: = Make the stats array (pmap) lockless so the aggregator doesn't block writers. - how would this work if multiple foreach loops access the same array? do we lose atomicity? = Consider bumping TRYLOCKDELAY and/or MAXTRYLOCK to higher defaults, or even dynamically tune them within overhead bounds. = Consider making MAXSKIPPED resettable in the script, or maybe just: function reset_skipped() %{ atomic_set(&skipped_count, 0); %} = Consider making a MAXSKIPPED a rate rather than an absolute count, so long-running scripts can choose to tolerate infrequent skips. = And of course, figure out why we're taking so long in the first place. 1ms / 800 printfs == 1.25us/printf ... can we do better?
See commit bdc8227, PR11473: Optimize references to iteration values With this change we get rid of redundant map lookups, and I see 12-15% timing improvement in the example above.