Translating exiting DTrace scripts into SystemTap scripts

If you are familiar with DTrace and have existing DTrace scripts to diagnose performance problems, it is not difficult to translate those existing DTrace into equivalent SystemTap scripts. The ouline of the process is:

These steps will be decribed in greater detail in the process of converting of converting some very simple DTrace examples from:

http://www.brendangregg.com/DTrace/dtrace_oneliners.txt

First Example: Successful Signal Send Details

One example in the DTrace one-liners prints out detailed information on signals:

dtrace -n 'proc:::signal-send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'

This command-line DTrace script prints out the executable name, the signal number, and the process pid each time a user process sends a signal.

Use the stap command and options

First step is to use the proper command and options for SystemTap to execute SystemTap from a command line ("stap -e"):

stap -e 'proc:::signal-send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'

Look at "man stap" for more details on the available options for the stap command.

Match up the DTrace providers and SystemTap probe points

There is not a one-to-one correspondence between DTrace providers and SystemTap probe points, but in most cases matches can be found. To get an understanding what a particular DTrace provides supplies look it up at:

SystemTap has similiar information describing the probe points and supporting functions at:

For this particular example we find that the SystemTap signal.send probe point is a good match for proc:::signal-send and the script is now written as:

stap -e 'probe signal.send /pid/ { printf("%s -%d %d",execname,args[2],args[1]->pr_pid); }'

Map the Dtrace built-in variable into Systemtap context variables and functions

SystemTap probe points and supporting functions are implmented as tapsets. These tapset provide the equivalent to the DTrace built-in variables and provider arguments. The DTrace example uses: pid and execname; these can be mapped to the pid() and execname() functions respectively. The DTrace proc:::signal-send provider args[2] is the signal number and arg[1]->pr_pid is the pid of the process receiving the signal. As described in the SystemTap documentation, the signal.send probepoint provides similar variables: sig and sig_pid. Thus, the script is now:

stap -e 'probe signal.send /pid()/ { printf("%s -%d %d",execname(), sig, sig_pid); }'

Convert DTrace predicates into SystemTap conditional statements

DTrace has a more restrictive execution model for the probe handlers than SystemTap as a result most DTrace scripts use predication. Systemtap is a bit more flexible and allow conditional code inside the probe handler. The direct translation of the predication would be to negate the predicate and use the next statement to skip the rest of the Systemtap probe handler:

stap -e 'probe signal.send { if (!pid()) next; printf("%s -%d %d",execname(), sig, sig_pid); }'

In this case it would be clearer to simply write the code as:

stap -e 'probe signal.send { if (pid()) printf("%s -%d %d",execname(), sig, sig_pid); }'

Convert thread local variable into associative arrays

In this parictular case the example doesn't have any thread local storage so nothing needs to done for this particular step.

Modify the DTrace printout code

There are many differences between Dtrace and SystemTap output. DTrace has more default rules to output data without explicit code in the script. Also DTrace adds newline to printf statment output. To avoid having this particular example have all output on a single line you need to add a "\n" to the printf function. The command line below is the completely translated script suitable for use with SystemTap:

stap -e 'probe signal.send { if (pid()) printf("%s -%d %d\n",execname(), sig, sig_pid); }'

Second Example: Write size distribution by Executable name

Another of the DTrace one liners prints out distributions on the size of data written by each executable:

dtrace -n 'sysinfo:::writech { @dist[execname] = quantize(arg0); }'

Use the stap command and options

You need to change the the "dtrace -n" into "stap -e", yielding:

stap -e 'sysinfo:::writech { @dist[execname] = quantize(arg0); }'

Match up DTrace providers to SystemTap probe points

The DTrace sysinfo:::writech provider instruments the write, writev and pwrite syscalls. The same syscalls exist in Linux. The script becomes:

stap -e 'probe syscall.write.return, syscall.writev.return, syscall.pwrite.return { @dist[execname] = quantize(arg0); }'

SystemTap allows multiple probe events to share the same probe handler. The multiple probe events can be specified with wild card or enumerated and separated by commas. For this particular example we must determine that the how much data was actually written and that the write was successful so the probes are on syscall.write.return, syscall.writev.return, and syscall.pwrite.return rather than on syscall.write, syscall.writev, and syscall.pwrite.

Map the Dtrace built-in variable into Systemtap context variables and functions

The DTrace execname is eqivalent to the SystemTap execname() function. Each *.return probe event includes a $return context variable which is the return value for the probe point. In this case that is the number of bytes actually written.

Like DTrace, SystemTap provides associative arrays and aggregates. However, SystemTap must have the associate arrays declared as global variable. You need to add "global dist" for the associative array to store the information. The indexing of the associative arrays is similar for SystemTap. SystemTap has statistical operator "<<<" to add a sample. This data can later be printed out as histograms or provide averages, counts, minimums, and maximum.

After modifying the script we now have:

stap -e 'global dist; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
{ dist[execname()] <<< $return; }
'

Convert DTrace predicates into SystemTap conditional statements

All of these probe events fire whether the write was successful or not. You need to put a test of the $return value to ensure that negative error values are not included in the data.

stap -e 'global dist; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
  { if ($return >=0) dist[execname()] <<< $return; }
'

Convert thread local variable into associative arrays

There are no thread local variables in this example, so nothing needs to be done for this step.

Modify the DTrace printout code

DTrace and SystemTap differ significantly in how they produce output. DTrace automatically selects the format of the output when the script exits. SystemTap needs a "probe end" event to print out the data in the desired format. In this case you want to print out @hist_log of each of the entries in the associative array. This is implement with a "foreach" statement. You also want to label the execname for each histogram, so a printf precedes the printing of the histogram. The final SystemTap script is:

stap -e ' global bytes; probe syscall.write.return, syscall.writev.return, syscall.pwrite.return
  { if ($return>=0) bytes[execname()] <<< $return }
probe end
  {foreach (e in bytes) {printf("%s\n", e); print(@hist_log(bytes[e]))}}
'

This script print the histograms out when it exits with a ctl-C.

Third Example: Translating scripts with thread-local variables

This example is from:

http://www.tablespace.net/quicksheet/dtrace-quickstart.html

Let's assume that the example is call read_time.stp and contains:

syscall::read:entry {
  self->stime = timestamp;
}

syscall::read:return /self->stime != 0/ {
  printf("%s read() %d nsecs\n",
  execname,
  timestamp - self->stime);
}

It will print out the the executable name followed by wallclock time in nanoseconds for each read syscall.

Use the stap command and options

Rename the script with the ".stp" extension to read_time.stp.

Match up DTrace providers to SystemTap probe points

The DTrace providers used in this example directly match SystemTap syscall.read and syscall.read.return. The current script is:

probe syscall.read {
  self->stime = timestamp;
}

probe syscall.read.return /self->stime != 0/ {
  printf("%s read() %d nsecs\n",
  execname,
  timestamp - self->stime);
}

Map the Dtrace built-in variable into Systemtap context variables and functions

SystemTap does not implement thread-local variable in the same manner as DTrace; you use a global array and the thread ID (tid()) to index the entries thread specific value in the global array. When a thread-local value is no longer needed it should be deleted to avoid filling the associative arrary with dead values. In this case the example has the global stime to hold the thread local values.

The DTrace timestamp and execname variables map to the SystemTap gettimeofday_ns() and execname() functions. This yields the following intermediate version of the script:

global stime

probe syscall.read {
  stime[tid()] = gettimeofday_ns();
}

probe syscall.read.return /self->stime != 0/ {
  printf("%s read() %d nsecs\n",
  execname(), gettimesofday_ns() - stime[tid()]);
  delete stime[tid()];
}

Convert DTrace predicates into SystemTap conditional statements

In the original DTrace script the predication limited the execution of the syscall::read:return event only to ones that had a matching syscall::read:entry timestamp. The SystemTap version of the script needs to do the same. By default if there is no entry in the associative array for a index value it is assumed to be 0. Subtracting the current time from zero will give a very large and incorrect value. This predication is implemented with a check to determine whether the current tid() has an entry in the associative array with the "in" operator:

global stime

probe syscall.read {
  stime[tid()] = gettimeofday_ns();
}

probe syscall.read.return {
  if (tid() in stime) {
    printf("%s read() %d nsecs\n",
    execname(), gettimeofday_ns() - stime[tid()]);
    delete stime[tid()];
  }
}

Eliminate stapio read syscalls from the output

The SystemTap script will instrument all syscall read operations including SystemTap's syscalls. Those can be filtered out with a conditional statement in the syscall.read event handler. This yields the following script:

global stime

probe syscall.read {
  if (pid() != stp_pid())
    stime[tid()] = gettimeofday_ns();
}

probe syscall.read.return {
  if (tid() in stime) {
    printf("%s read() %d nsecs\n",
    execname(), gettimeofday_ns() - stime[tid()]);
    delete stime[tid()];
 }
}

None: PortingDTracetoSystemTap (last edited 2012-11-08 19:11:35 by WilliamCohen)