11472 – many skipped probes due lengthy printing delays

Bug 11472 - many skipped probes due lengthy printing delays

Summary: many skipped probes due lengthy printing delays

Status:	NEW

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	runtime (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Unassigned

URL:
Keywords:

Depends on:	3217 11528
Blocks:	11179
	Show dependency tree / graph

Reported:	2010-04-06 19:59 UTC by Josh Stone
Modified:	2010-05-07 22:20 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josh Stone 2010-04-06 19:59:40 UTC

On IRC, soren reported many skipped probes, which we determined to be because
the collection probes were waiting for a lock held by a slow reporting probe. 
This simplified script can show the long delays:


global vfs_write_bytes

probe timer.s(1) {
  for (i=0; i<400; ++i)
    vfs_write_bytes[i, sprint(i), sprint(i*i)] <<< i
}

probe timer.s(1) {
  foreach ([uid,fsname,devname] in vfs_write_bytes) {
    printf("vfs_write_bytes[uid=%d,fsname=%s,devname=%s] = %d\n",
           uid, fsname, devname, @sum(vfs_write_bytes[uid,fsname,devname]))
    printf("vfs_write_calls[uid=%d,fsname=%s,devname=%s] = %d\n",
           uid, fsname, devname, @count(vfs_write_bytes[uid,fsname,devname]))
  }
  delete vfs_write_bytes
}


"stap -t" shows the reporting loop to take well over 1ms for just 400 entries
(thus 800 lines).

I experimented with commenting out printfs, and I estimate that the overhead is
about 20% in the foreach aggregation and 40% each for the printfs.

Changing the key tuple from (long,string,string) to (long,long,long) didn't seem
to have any measurable effect.

Bumping TRYLOCKDELAY to 20000 (20us) fixed the problem for soren, such that the
collection probes are now willing to wait long enough for the reporter for
finish.  I think in the general case this is too long, but at least our tunables
were able to make it work.

We bounced around many ideas on how we should improve, which I'll try to summarize:

= Make the stats array (pmap) lockless so the aggregator doesn't block writers.
- how would this work if multiple foreach loops access the same array? do we
lose atomicity?

= Consider bumping TRYLOCKDELAY and/or MAXTRYLOCK to higher defaults, or even
dynamically tune them within overhead bounds.

= Consider making MAXSKIPPED resettable in the script, or maybe just:
  function reset_skipped() %{ atomic_set(&skipped_count, 0); %}

= Consider making a MAXSKIPPED a rate rather than an absolute count, so
long-running scripts can choose to tolerate infrequent skips.

= And of course, figure out why we're taking so long in the first place.
  1ms / 800 printfs == 1.25us/printf ... can we do better?

Comment 1 Josh Stone 2010-04-20 03:07:20 UTC

See commit bdc8227, PR11473: Optimize references to iteration values

With this change we get rid of redundant map lookups, and I see 12-15% timing
improvement in the example above.