Thoughts on discovering bpf raw tracepoints arguments

Tue Jul 9 16:51:00 GMT 2019

The bpf raw tracepoints are much closer in functionality to the
traditional systemtap linux kernel module tracepoints than the bpf
tracepoints.  The regular bpf tracepoints present a subset of fields
extracted from the trace point data.  Working on this to address
pr23866.

I have been looking at implmenting the bpf raw tracepoints in
systemtap and have patches for systemtap to generate bpf code and
insert that code for bpf raw tracepoints.  However, the code lacks
access to the bpf raw tracepoint arguments.  This is outlining
thoughts on how to discover the bpf raw tracepoints arguments and
generate the bpf code to access them.

The existing bpf tracepoint discovery mechanism in systemtap works by
creating an equivalent struct that describes various target variables
available via a set of generated c files with special macros that use
the kernel's trace event header files.  Thus for example for the
kernel.trace("sched_switch" tracepoint end up having the following
data struct generated after all the macros are processed:

struct stapprobe_sched_switch {
  unsigned long long pad;
  char prev_comm[16];
  pid_t prev_pid;
  int prev_prio;
  long prev_state;
  char next_comm[16];
  pid_t next_pid;
  int next_prio;
} stapprobe_sched_switch;

The initial field pad is unused padding and the build_args_for_bpf()
method skips over it.  The data struct is used to compute the offsets
off of argument register R1 that points to the beginning of the data
struct when the tracepoint triggers.  Note the bpf tracepoint
arguments for this call are different than the original tracepoint
args described in the TP_PROTO in linux/trace/events/sched.h:

	TP_PROTO(bool preempt,
		 struct task_struct *prev,
		 struct task_struct *next),

The bpf raw tracepoints would have those values stored in an array of
u64 values, a different layout than the regular bpf tracepoints.
There is no padding and all the elements are just stuffed into 64-bit
locations.  However, it looks like with some heavy macro magic
(similar to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/trace/bpf_probe.h#n30)
changing DECLARE_TRACE macro it would be possible to generate:

struct stapprobe_sched_switch {
       bool preempt __attribute__ ((aligned (8)));
       struct task_struct *prev __attribute__ ((aligned (8)));
       struct task_struct *next __attribute__ ((aligned (8)));
} stapprobe_sched_switch;

The next problem would be to make it so that the build_args_for_bpf()
can handle either version of the data struct.  Thinking that the
actual data be wrapped in a struct so it can be found in either
regardless of the padded so the regular struct becomes:

struct stapprobe_sched_switch {
  unsigned long long pad;
  struct {
  char prev_comm[16];
  pid_t prev_pid;
  int prev_prio;
  long prev_state;
  char next_comm[16];
  pid_t next_pid;
  int next_prio;
  } data;
} stapprobe_sched_switch;

And the bpf raw tracepoint one becomes:

struct stapprobe_sched_switch {
  struct {
       bool preempt __attribute__ ((aligned (8)));
       struct task_struct *prev __attribute__ ((aligned (8)));
       struct task_struct *next __attribute__ ((aligned (8)));
       } data;
} stapprobe_sched_switch;

So the build_args_for_bpf() would need to look for the struct inside.
I tried to get this working for the existing bpf tracepoint, but
didn't get that working.  libdwarf not so simple to understand and
use, don't have the code recognizing the struct in a struct.

Another question is how to identify whether bpf raw tracepoints are
supported on the kernel.  There is some mechanism for this when
compiling the resulting stap kernel modules. However, the
determination whether the bpf raw tracepoints are supported is needed
much earlier.