Bug 909 - perf counter events, perfmon? kernel API
Summary: perf counter events, perfmon? kernel API
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: tapsets (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Josh Stone
URL:
Keywords:
: 5632 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-04-29 20:15 UTC by E. Zannoni
Modified: 2015-11-20 18:29 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
simple module to exercise perfmon2 kabi on AMD64 (1.63 KB, application/x-tar)
2006-06-14 17:07 UTC, William Cohen
Details
patch with code locating data, setup, and shutdown (3.05 KB, patch)
2006-08-02 15:42 UTC, William Cohen
Details | Diff
patch for perfmon tapset (4.32 KB, patch)
2006-08-08 18:39 UTC, William Cohen
Details | Diff
latest (non-work) revision of the perfmon2 systemtap support (7.11 KB, patch)
2006-08-25 13:16 UTC, William Cohen
Details | Diff
Patch that has $counter probe variable working with perfmon (6.97 KB, patch)
2006-08-25 15:19 UTC, William Cohen
Details | Diff
A simple processor (x86_64) specific example that exercises the permon hw (171 bytes, text/plain)
2006-08-25 15:22 UTC, William Cohen
Details
incorporating generic cycles and instructions (7.01 KB, patch)
2006-08-25 21:21 UTC, William Cohen
Details | Diff
Perfmon probes with documentation and test (8.30 KB, patch)
2006-08-27 18:01 UTC, William Cohen
Details | Diff
systemtap perfmon support working with current snapshot 20060911 (9.28 KB, patch)
2006-09-12 18:07 UTC, William Cohen
Details | Diff
Simple C code to exercise the perf event kernel-api (1.01 KB, application/x-tar)
2010-02-05 20:14 UTC, William Cohen
Details
Proto runtime functions to access kernel perf api (926 bytes, text/plain)
2010-02-12 17:14 UTC, William Cohen
Details
Header file describing the perf data structures for systemtap runtime (456 bytes, text/plain)
2010-02-12 17:15 UTC, William Cohen
Details
Very simple test code to check operation of the perf runtime (339 bytes, text/plain)
2010-02-12 17:20 UTC, William Cohen
Details
Revised runtime functions to access kernel perf api (879 bytes, text/plain)
2010-02-12 21:49 UTC, William Cohen
Details
A systemtap script to check that perf runtime sets up sampling (532 bytes, text/plain)
2010-02-12 21:51 UTC, William Cohen
Details
quick dump of what is in the local tree (7.63 KB, patch)
2010-03-16 20:30 UTC, William Cohen
Details | Diff
initial runtime patch (2.41 KB, patch)
2010-03-17 15:18 UTC, William Cohen
Details | Diff
runtime change to make it easier to funnel perf interrupts through same function (1.29 KB, patch)
2010-03-17 15:20 UTC, William Cohen
Details | Diff
strip out the old perfmon2 based code in the translator (4.94 KB, patch)
2010-03-17 15:20 UTC, William Cohen
Details | Diff
provides a perf.cycles(num) probe tapset entry in translator (2.15 KB, patch)
2010-03-17 15:22 UTC, William Cohen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description E. Zannoni 2005-04-29 20:15:21 UTC
 
Comment 1 Frank Ch. Eigler 2005-05-01 00:47:34 UTC
need to consider whether Mikael Pettersson's perfctr patches should be adopted
as a prerequisite
Comment 2 Frank Ch. Eigler 2006-01-30 19:35:39 UTC
Will kindly volunteered to shepherd this process.
Comment 3 William Cohen 2006-01-30 19:43:29 UTC
Looking at Perfmon2 patches as consistent way to mangage performance monitoring
hardware. Perfctr is only for x86 processors. Perfmon is already upstream for
ia64 kernel. Perfmon2 expands the interface to work with other processor
architectures. Both of these performance monitoring hardware interfaces have
user-space API (system calls); there isn't an internal kernel ABI to
setup/start/stop the performance monitoring hardware.
Comment 4 Frank Ch. Eigler 2006-01-30 20:46:05 UTC
(In reply to comment #3)
> [...] there isn't an internal kernel ABI to
> setup/start/stop the performance monitoring hardware.

Bringing this into existence is what the task is all about.
Comment 5 William Cohen 2006-01-30 20:51:34 UTC
Yes, of course the important part is to implement access for the performance
monitoring hardware for SystemTap. The comment on perfmon2 and perfctr was to
describe what is currently missing.
Comment 6 William Cohen 2006-06-14 17:07:27 UTC
Created attachment 1087 [details]
simple module to exercise perfmon2 kabi on AMD64

The simple example sets up the perfmon hw using the perfmon2 kabi. The counting
starts when the module is loaded into the kernel. The counting stops when the
module is unloaded. The resulting count is printed into the /var/log/messages
via a printk.
Comment 7 William Cohen 2006-08-02 15:42:26 UTC
Created attachment 1192 [details]
patch with code locating data, setup, and shutdown

The patch is missing:
-mechanism for graceful build of systemtap without perfmon on the machine or in
kernel
-parsing of the event specification
-generation of the data bits to put into the performance monitoring hardware
-code to link perfmon register to handle describing event
Comment 8 Frank Ch. Eigler 2006-08-02 16:18:06 UTC
(In reply to comment #7)
> Created an attachment (id=1192)
> patch with code locating data, setup, and shutdown

The runtime portion looks okay.

> The patch is missing:
> -mechanism for graceful build of systemtap without perfmon on the machine or in
> kernel

Something like the analogous code for kretprobes would be fine (emit #ifdefs).

> -parsing of the event specification
> -generation of the data bits to put into the performance monitoring hardware

These are the core of the work.

> -code to link perfmon register to handle describing event

There is a perfmon2 api for that, I assume.

In any case, posting in-development snapshots as attachments here is not
really necessary.  If you don't feel that your code can go into CVS mainline
(protected by some #ifdef for example), then consider creating a development
branch.
Comment 9 William Cohen 2006-08-08 18:39:36 UTC
Created attachment 1214 [details]
patch for perfmon tapset

Revision to the patch:

-configure options to control building code that uses perfmon2 in translator
-localize the perfmon code generations to tapsets.cxx
-uses libpfm to map the event names and configuration to magic bits

TODO:

-eliminate requirement for perform2 includes to build stap kernel modules
-code to allow script to read perfmon counts
-move probes_{allocated|registers|deregistered} and event_list into
  perfmon_derived_probes
Comment 10 William Cohen 2006-08-25 13:16:57 UTC
Created attachment 1247 [details]
latest (non-work) revision of the perfmon2 systemtap support

The attached patch compile. It has a "$counter" target variable for a handle to
indentify which counter is being used to store that event. However the
translator can't find the new variable and seg faults on the following example:


global handle
global cycles_start, cycles_end

probe perfmon.counter("CPU_CLK_UNHALTED") {handle=$counter}

probe begin { cycles_start = read_counter(handle) }
probe end
{
  cycles_end = read_counter(handle);
  elapsed = cycles_start - cycles_end;
  printf("%d cycles\n", elapsed);
}


(gdb) run -p4 -k ../elapsed.stp
Starting program:
/home/wcohen/research/profiling/systemtap_perfmon/install/bin/stap -p4 -k
../elapsed.stp
while searching for arity 0 function:
semantic error: unresolved function call: identifier '$counter' at
../elapsed.stp:4:51

Program received signal SIGSEGV, Segmentation fault.
0x000000000042e60f in symresolution_info::visit_symbol (this=0x7fffd7223960,
    e=0xb34790)
    at
/usr/lib/gcc/x86_64-redhat-linux/4.0.2/../../../../include/c++/4.0.2/ext/mt_allocator.h:585

585	      { ::new(__p) _Tp(__val); }
(gdb) where
#0  0x000000000042e60f in symresolution_info::visit_symbol (
    this=0x7fffd7223960, e=0xb34790)
    at
/usr/lib/gcc/x86_64-redhat-linux/4.0.2/../../../../include/c++/4.0.2/ext/mt_allocator.h:585

#1  0x000000000041d221 in traversing_visitor::visit_assignment (
    this=0x7fffd7223960, e=0xb34510) at ../src/staptree.cxx:1434
#2  0x000000000041f5af in assignment::visit (this=0xb34510, u=0x7fffd7223960)
    at ../src/staptree.cxx:1070
#3  0x00000000004292dd in symresolution_info::visit_block (
    this=0x7fffd7223960, e=0xb34370) at ../src/elaborate.cxx:973
#4  0x000000000042d8e4 in semantic_pass_symbols (s=@0x7fffd7223d50)
    at ../src/elaborate.cxx:891
#5  0x00000000004307e1 in semantic_pass (s=@0x7fffd7223d50)
    at ../src/elaborate.cxx:916
#6  0x000000000040830e in main (argc=Variable "argc" is not available.
) at ../src/main.cxx:468
Comment 11 William Cohen 2006-08-25 15:19:44 UTC
Created attachment 1248 [details]
Patch that has $counter probe variable working with perfmon

A pass by reference was missing in the code. This patch allows the probe to use
the $counter to get a handle to uniquely identify which register is being used.
Comment 12 William Cohen 2006-08-25 15:22:58 UTC
Created attachment 1249 [details]
A simple processor (x86_64) specific example that exercises the permon hw

elapsed.stp is a very simple example that shows how the $counter variable is
used. For this to work you will need libpfm from perfmon2.sourceforge.net
installed, a kernel with the matching perfmon patches, and configure stap with:


 --enable-perfmon

-Will
Comment 13 Frank Ch. Eigler 2006-08-25 17:35:33 UTC
BTW "--enable-perfmon" could be auto-detected by an autoconf test looking for
libpfm.
Comment 14 William Cohen 2006-08-25 21:21:05 UTC
Created attachment 1252 [details]
incorporating generic cycles and instructions
Comment 15 William Cohen 2006-08-27 18:01:12 UTC
Created attachment 1255 [details]
Perfmon probes with documentation and test

The new patch has a simple check to exercise the perfmon probes code in the
translator and some some documentation in stapfuncs and stapprobes describing
how the code works. For the time being the perfmon01.stp test is KFAIL because
most stock machines are not going to be setup with the proper kernel to run
this.

There are some caveats with the current code:
-need to have a kernel with perfmon2 patches applied
-need to have explicitly load the appropriate perfmon kernel module,
 e.g. perfmon_amd64 before running systemtap perfmon probe
 work around is to run pfmon on something to load the module

I ran the "make installcheck" the code doesn't seem to instroduce new
regressions.
Comment 16 William Cohen 2006-08-27 18:12:45 UTC
Proposed ChangeLog entry.

2006-08-27  William Cohen  <wcohen@redhat.com>

	* configure: Flag for enabling perfmon support.
	* configure.ac: Regenerated.

	* main.cxx:
	* session.h:
	* tapsets.cxx:
	* translate.cxx:
	* runtime/perf.c
	* runtime/perf.h
	* runtime/runtime.h
	* tapset/perfmon.stp: Support for perfmon hardware probes.

	* stapfuncs.5.in:
	* stapprobes.5.in: Documentation on perfmon probes

	* testsuite/buildok/perfmon01.stp:
	* testsuite/systemtap.pass1-4/buildok.exp: Test for perfmon probes.

Comment 17 William Cohen 2006-09-12 18:07:47 UTC
Created attachment 1294 [details]
systemtap perfmon  support working with current snapshot 20060911

TODOS on patch:

-make sure perfmon module installed before using kernel perfmon api
	how doe pfmon make sure that the module is installed?
-enable smp
-perfmon sampling
-cross system use (deferring)
Comment 18 Josh Stone 2009-06-11 22:36:38 UTC
FYI: The perf_counter API has been pulled in the 2.6.31 merge window.  See:
http://marc.info/?l=linux-kernel&m=124473633222515
Comment 19 Frank Ch. Eigler 2010-01-18 20:19:17 UTC
(Assuming 2.6.33's changes enable this.)
Comment 20 William Cohen 2010-02-05 20:14:52 UTC
Created attachment 4575 [details]
Simple C code to exercise the perf event kernel-api

The perf_period.tar.gz is a tarball containing a couple very simple examples to
see how to use the perf event kernel-api.

Once the code is unpacked the perf_period.ko and perf_period_smp.ko are built
with:


make -C "/lib/modules/`uname -r`/build" M=`pwd` ARCH="x86_64" modules V=1

These perf_period.c only samples on cpu0. The perf_pueriod_smp.c sets up
sampling on all the processors on the machine.	The perf_period.ko generates
output like the following in /var/log/messages:

...
Feb  5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event
ffff8801ba7f0448 count 615
Feb  5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event
ffff8801ba7f0448 count 616
Feb  5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event
ffff8801ba7f0448 count 617
Feb  5 15:12:19 dhcp231-201 kernel: event sampling shutdown


The perf_period_smp.ko module just tallys the number of times that each
processor has an event overflow:

Feb  5 15:13:44 dhcp231-201 kernel: event sampling setup
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu0 event
ffff8801b2dfeb08 count 22
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu1 event
ffff8801b2dfd9e8 count 18
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu2 event
ffff8801b2dfaad0 count 0
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu3 event
ffff8801b2df9df8 count 7
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu4 event
ffff8801b2dfc8c8 count 10
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu5 event
ffff8801b2df8890 count 13
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu6 event
ffff8801b2dfe6c0 count 0
Feb  5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu7 event
ffff8801b2dfe278 count 0
Feb  5 15:13:51 dhcp231-201 kernel: event sampling shutdown
Comment 21 William Cohen 2010-02-10 14:56:01 UTC
The 2.6.33 kernel provides a kernel-api for accessing the performance
event. The interface consists of three functions:

extern struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr,
				int cpu,
				pid_t pid,
				perf_overflow_handler_t callback);

extern u64 perf_event_read_value(struct perf_event *event,
				 u64 *enabled, u64 *running);

extern int perf_event_release_kernel(struct perf_event *event);

The perf_event_create_kernel_counters() sets up a performance
events. The perf_event_read_value() reads the current value for the
performance event and the perf_event_release_kernel() frees the
performance event when it is no longer used.

There isn't really a concept of global count. The performance events
can be associated with a pid or a process (but not both). The
performance events can also be set up to trigger a call back when the
count is execeeded.

The code for perf_event_read_value() uses spinlocks and
inter-processor interrupts, so it isn't clear that this function will
work for all the possible situations that a might occur in a systemtap
probe handler. However the environment for the callback function has
restrictions similar to those of systemtap probe handlers. Thus, to
make things managable the performance event will be limited to
sampling. The syntax would look similar to the timer probe:

probe perf.EVENT(N) {
/* */
}

EVENT would be one of the following (might need to replace '-' with '_'):

  cpu-cycles OR cycles                       [Hardware event]
  instructions                               [Hardware event]
  cache-references                           [Hardware event]
  cache-misses                               [Hardware event]
  branch-instructions OR branches            [Hardware event]
  branch-misses                              [Hardware event]
  bus-cycles                                 [Hardware event]

N would be the interval between samples

To aid implementation there would be a supporting struct and two
runtime functions:

struct _Perf {
	/* per-cpu data. allocated with _stp_alloc_percpu() */
	stat *pd;
};

typedef struct _Perf *Perf;

static Perf _stp_perf_init (struct perf_event_attr *attr,
				perf_overflow_handler_t callback);

static void _stp_perf_del (Perf pe);


Need to have a test to check to see if the kernel will support the
performance kernel perf api.
Comment 22 William Cohen 2010-02-12 17:14:13 UTC
Created attachment 4590 [details]
Proto runtime functions to access kernel perf api

This is a first pass at a systemtap/runtime/perf.c that contains functions to
implement add perf support to systemtap
Comment 23 William Cohen 2010-02-12 17:15:35 UTC
Created attachment 4591 [details]
Header file describing the perf data structures  for systemtap runtime 

This is the prototype header file describing the perf data structures.
Comment 24 William Cohen 2010-02-12 17:20:54 UTC
Created attachment 4592 [details]
Very simple test code to check operation of the perf runtime

The perf_test1.stp is mainly to test that the perf runtime compiles and runs on
a kernel that has the perf kernel abi (2.6.33-rc*). The test is run with:

/usr/local/bin/stap -k -g -p4 -mperf_test1 -vvv /home/wcohen/perf_test1.stp 
/usr/local/bin/staprun perf_test1.ko

Unfortunately, the  perf_event_create_kernel_counter() returns an error and
then the module oops in the _stp_kfree(pe->pd) after exit2 label
Comment 25 William Cohen 2010-02-12 21:49:20 UTC
Created attachment 4593 [details]
Revised runtime functions to access kernel perf api

This perf.c goes in the runtime directory. It corrects the freeing of the
percpu object.

One needs to be root for this runtime code to operate. Because the kernel has
the following code in kernel/perf_event.c which limits the per cpu set up to
root:

static struct perf_event_context *find_get_context(pid_t pid, int cpu)
{
...
	if (pid == -1 && cpu != -1) {
		/* Must be root to operate on a CPU event: */
		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
			return ERR_PTR(-EACCES);

	if (!attr.exclude_kernel) {
		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
			return -EACCES;
	}
Comment 26 William Cohen 2010-02-12 21:51:48 UTC
Created attachment 4594 [details]
A systemtap script to check that perf runtime sets up sampling

The perf_test2.stp is a really stupid example that is compiled and run with:

/usr/local/bin/stap -k -g -p4 -mperf_test2 -vvv /home/wcohen/perf_test2.stp 

As root:

/usr/local/bin/staprun /home/wcohen/perf_test2.ko


In the /var/log/messages will see something like the following:

Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu0 count 4
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu1 count 3
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu2 count 1
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu3 count 1
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu4 count 0
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu5 count 7
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu6 count 0
Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu7 count 0
Comment 27 William Cohen 2010-03-16 20:30:16 UTC
Created attachment 4664 [details]
quick dump of what is in the local tree

This is a quick dump of what is in the local tree.
Comment 28 William Cohen 2010-03-17 15:18:29 UTC
Created attachment 4666 [details]
initial runtime patch
Comment 29 William Cohen 2010-03-17 15:20:00 UTC
Created attachment 4667 [details]
runtime change to make it easier to funnel perf interrupts through same function
Comment 30 William Cohen 2010-03-17 15:20:48 UTC
Created attachment 4668 [details]
strip out the old perfmon2 based code in the translator
Comment 31 William Cohen 2010-03-17 15:22:00 UTC
Created attachment 4669 [details]
provides a perf.cycles(num) probe tapset entry in translator
Comment 32 Josh Stone 2010-03-17 19:47:30 UTC
I'm working from wcohen's patches to see if we can get this in before the next
release...
Comment 33 Josh Stone 2010-03-18 01:15:05 UTC
I pushed my changes with Will's out to git now, so testing and feedback
would be appreciated.  A few TODOs and notes:

- Wildcards.  While I don't think it's that useful to probe too many
  events at once, we still want to do things like:
    $ stap -l 'perf.events("*")'

- Context variables.  Probably need $name at least, others?

- Replan the runtime/perf data structures.  At the moment there's some
  redundant/unused variables.  I think we can also do away with the
  separate entry handler on for each probe by scanning for a matching
  perf_event handle in our saved list.

- Version check.  I'd like a better way to determine availability
  besides CONFIG_PERF_EVENTS, as that predates the actual kernel API
  that we need.  I think we might need to just dig in System.Map for
  perf_event_create_kernel_counter.

- Hard-coded events.  This is not ideal, but the perf tool also does
  this.  At least it's supposed to be ABI-stable.  We might also consider
  using libpfm4 to get event info dynamically.

- Other sampling/counting modes.  Right now, we're just attaching events
  to every cpu and sampling the overflows.  We may later want to attach
  to specific processes instead, and perhaps expose the event counters
  somehow to other probe points.
Comment 34 Frank Ch. Eigler 2010-03-19 16:42:20 UTC
*** Bug 5632 has been marked as a duplicate of this bug. ***
Comment 35 Frank Ch. Eigler 2010-03-19 18:59:56 UTC
Let's consider the basic functionality done; new functions to be tracked separately.
Comment 36 Elion 2015-11-20 18:28:24 UTC Comment hidden (spam)
Comment 37 Elion 2015-11-20 18:29:31 UTC Comment hidden (spam)