Using Markers

What are markers?

Here is some text taken from the kernel documentation that describes markers:

A marker placed in code provides a hook to call a function (probe) that you can provide at runtime. A marker can be "on" (a probe is connected to it) or "off" (no probe is attached). When a marker is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a marker is "on", the function you provide is called each time the marker is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the marker site).
You can put markers at important locations in the code. Markers are lightweight hooks that can pass an arbitrary number of parameters, described in a printk-like format string, to the attached probe function.
They can be used for tracing and performance accounting.

What do markers in kernel code look like?

#include <linux/marker.h>
//...
int kfunc(struct inode *i, int op)
{
        int rc = 0;        // return code
        trace_mark(kfunc_entry, "inode %p op %d", i, op);

        //... bulk of kfunc() here...

        trace_mark(kfunc_exit, "rc %d", rc);
        return(rc);
}

This mythical function (named 'kfunc'), has 2 markers present in it. The first one has a subsystem_event of "kfunc_entry" and the second marker has a subsystem_event of "kfunc_exit". The kernel documentation suggests treating the first argument of trace_mark as having 2 parts: a 'subsystem' and an 'event'. In our example, the name of the function is used as the subsystem, and 'entry' and 'exit' are used as the events. As you can see, the second argument to trace_mark() is a format string (similar to one used by printk()), and the rest of the arguments depend on the format string.

How do I turn on marker support in my kernel?

The marker subsystem and the kernel markers themselves must be compiled into your kernel. If your kernel version is 2.6.25 or higher, you should have all the functionality you need. Initial marker support was present in kernel version 2.6.24, but for systemtap's use you must also add 3 patches to get full marker functionality:

linux-kernel-markers-create-modpost-file.patch: adds support for multiple probes per markers.

linux-kernel-markers-support-multiple-probes-update.patch: updates the previous patch.

linux-kernel-markers-support-multiple-probes.patch: adds support for creating a file called Module.markers which lists all markers present in a kernel and its modules (similar to the Module.symvers file).

After those patches have been applied to your kernel source (if needed), when running "make menuconfig", besides the normal options needed by systemtap, you'll also need to enable markers.

Instrumentation Support  --->
    [*] Activate markers

Selecting this option turns on the CONFIG_MAKERS define.

How do I use markers in systemtap?

To hook up a systemtap probe to a kernel marker, your systemtap script would use the 'kernel.mark("NAME")' facility. Using the example from above:

probe kernel.mark("kfunc_entry") { printf("kfunc_entry marker hit\n") }
probe kernel.mark("kfunc_exit") { printf("kfunc_exit marker hit\n") }

When kfunc() gets called by the kernel, both systemtap probes would be hit and you would see the appropriate output.

How do I access the marker format string?

The handler associated with a marker-based probe may read the format string specified at the marker call site. The format string is named $format.

Using the example from above:

probe kernel.mark("kfunc_entry") { printf("kfunc_entry marker hit: %s, %p, %d\n", $format, $arg1, $arg2) }

How do I access marker arguments?

The handler associated with a marker-based probe may read the optional parameters specified at the marker call site. These are named $arg1 through $argNN, where 'NN' is the number of parameters supplied by the marker. Number and string parameters are handled in a type-safe manner.

Using the example from above:

probe kernel.mark("kfunc_entry") { printf("kfunc_entry marker hit: %p, %d\n", $arg1, $arg2) }
probe kernel.mark("kfunc_exit") { print("kfunc_exit marker hit: %d\n", $arg1) }

How are marker arguments handled that are structure pointers?

Not well. Because systemtap has no DWARF information for marker arguments, it really doesn't know what type they are. In a DWARF-based probe, you could write something like this:

probe kernel.function("kfunc") { printf("inode number: %d\n", i->i_ino) }

However, if you try something similar with a marker-based probe, you'll get an error because systemtap doesn't know that:

$arg1 is a pointer
the type of what $arg1 points to

So, to work around this problem, you'll have to write an access function and use the '-g' (guru mode) systemtap option. The script to access the i_ino field out of struct inode would look like this. This embedded-C routine uses a systemtap library routine in order to tolerate even invalid incoming inode pointers.

function inode_get_i_ino:long (i:long) %{ /* pure */
        struct inode *inode = (struct inode *)(long)THIS->i;
        THIS->__retvalue = kread(&(inode->i_ino));
        CATCH_DEREF_FAULT();
%}
probe kernel.mark("kfunc_entry") { printf("inode number: %d\n", inode_get_i_ino($arg1)) }

You can also try using a @cast, so that systemtap will generate the access functions for you. You don't need to run in guru mode with this method, but it does require you to have the kernel debuginfo installed.

probe kernel.mark("kfunc_entry") { printf("inode number: %d\n", @cast($arg1, "inode")->i_ino) }

A simpler solution to this problem would be to go ahead and put the interesting structure fields in the marker itself. If we were only interested in the i_ino field, this would change the marker code to look like:

#include <linux/marker.h>
//...
int kfunc(struct inode *i, int op)
{
        int rc = 0;        // return code
        trace_mark(kfunc_entry, "inode %p i_ino %lu op %d", inode, inode->i_ino, op);

        //... bulk of kfunc() here...

        trace_mark(kfunc_exit, "rc %d", rc);
        return(rc);
}

Then the corresponding systemtap script would be

probe kernel.mark("kfunc_entry") { printf("inode number: %d\n", $arg2) }

How does systemtap handle markers that have the same name but different format strings?

It is possible (although not recommended) for two (or more) markers to have the same name but different format strings. Assume the following kernel function:

#include <linux/marker.h>
//...
int kfunc2(struct inode *i, int op)
{
        int rc = 0;        // return code
        char msg[100];

        trace_mark(kfunc2, "inode %p op %d", inode, op);

        //... bulk of kfunc2() here...

        trace_mark(kfunc2, "msg %s rc %d", msg, rc);
        return(rc);
}

There are two instances of a marker named 'kfunc2', each with a different format string. If you wrote a probe that accessed '$arg1' as a number, it will fail since for the second instance of 'kfunc2' (since '$arg1' is a string in the second instance of the 'kfunc2' marker).

To uniquely specify either marker, use the optional marker probe '.format(FORMAT)' specifier.

global x = 0
probe kernel.mark("kfunc2").format("inode*") { x += $arg1 }
probe kernel.mark("kfunc2").format("msg*") { printf("msg %s\n", $arg1) }