Bug 11472 - many skipped probes due lengthy printing delays
Summary: many skipped probes due lengthy printing delays
Status: NEW
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on: 3217 11528
Blocks: 11179
  Show dependency treegraph
 
Reported: 2010-04-06 19:59 UTC by Josh Stone
Modified: 2010-05-07 22:20 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Josh Stone 2010-04-06 19:59:40 UTC
On IRC, soren reported many skipped probes, which we determined to be because
the collection probes were waiting for a lock held by a slow reporting probe. 
This simplified script can show the long delays:


global vfs_write_bytes

probe timer.s(1) {
  for (i=0; i<400; ++i)
    vfs_write_bytes[i, sprint(i), sprint(i*i)] <<< i
}

probe timer.s(1) {
  foreach ([uid,fsname,devname] in vfs_write_bytes) {
    printf("vfs_write_bytes[uid=%d,fsname=%s,devname=%s] = %d\n",
           uid, fsname, devname, @sum(vfs_write_bytes[uid,fsname,devname]))
    printf("vfs_write_calls[uid=%d,fsname=%s,devname=%s] = %d\n",
           uid, fsname, devname, @count(vfs_write_bytes[uid,fsname,devname]))
  }
  delete vfs_write_bytes
}


"stap -t" shows the reporting loop to take well over 1ms for just 400 entries
(thus 800 lines).

I experimented with commenting out printfs, and I estimate that the overhead is
about 20% in the foreach aggregation and 40% each for the printfs.

Changing the key tuple from (long,string,string) to (long,long,long) didn't seem
to have any measurable effect.

Bumping TRYLOCKDELAY to 20000 (20us) fixed the problem for soren, such that the
collection probes are now willing to wait long enough for the reporter for
finish.  I think in the general case this is too long, but at least our tunables
were able to make it work.

We bounced around many ideas on how we should improve, which I'll try to summarize:

= Make the stats array (pmap) lockless so the aggregator doesn't block writers.
- how would this work if multiple foreach loops access the same array? do we
lose atomicity?

= Consider bumping TRYLOCKDELAY and/or MAXTRYLOCK to higher defaults, or even
dynamically tune them within overhead bounds.

= Consider making MAXSKIPPED resettable in the script, or maybe just:
  function reset_skipped() %{ atomic_set(&skipped_count, 0); %}

= Consider making a MAXSKIPPED a rate rather than an absolute count, so
long-running scripts can choose to tolerate infrequent skips.

= And of course, figure out why we're taking so long in the first place.
  1ms / 800 printfs == 1.25us/printf ... can we do better?
Comment 1 Josh Stone 2010-04-20 03:07:20 UTC
See commit bdc8227, PR11473: Optimize references to iteration values

With this change we get rid of redundant map lookups, and I see 12-15% timing
improvement in the example above.