Sources Bugzilla – Bug 909
perf counter events, perfmon? kernel API
Last modified: 2010-03-23 21:16:53 UTC
need to consider whether Mikael Pettersson's perfctr patches should be adopted as a prerequisite
Will kindly volunteered to shepherd this process.
Looking at Perfmon2 patches as consistent way to mangage performance monitoring hardware. Perfctr is only for x86 processors. Perfmon is already upstream for ia64 kernel. Perfmon2 expands the interface to work with other processor architectures. Both of these performance monitoring hardware interfaces have user-space API (system calls); there isn't an internal kernel ABI to setup/start/stop the performance monitoring hardware.
(In reply to comment #3) > [...] there isn't an internal kernel ABI to > setup/start/stop the performance monitoring hardware. Bringing this into existence is what the task is all about.
Yes, of course the important part is to implement access for the performance monitoring hardware for SystemTap. The comment on perfmon2 and perfctr was to describe what is currently missing.
Created attachment 1087 [details] simple module to exercise perfmon2 kabi on AMD64 The simple example sets up the perfmon hw using the perfmon2 kabi. The counting starts when the module is loaded into the kernel. The counting stops when the module is unloaded. The resulting count is printed into the /var/log/messages via a printk.
Created attachment 1192 [details] patch with code locating data, setup, and shutdown The patch is missing: -mechanism for graceful build of systemtap without perfmon on the machine or in kernel -parsing of the event specification -generation of the data bits to put into the performance monitoring hardware -code to link perfmon register to handle describing event
(In reply to comment #7) > Created an attachment (id=1192) > patch with code locating data, setup, and shutdown The runtime portion looks okay. > The patch is missing: > -mechanism for graceful build of systemtap without perfmon on the machine or in > kernel Something like the analogous code for kretprobes would be fine (emit #ifdefs). > -parsing of the event specification > -generation of the data bits to put into the performance monitoring hardware These are the core of the work. > -code to link perfmon register to handle describing event There is a perfmon2 api for that, I assume. In any case, posting in-development snapshots as attachments here is not really necessary. If you don't feel that your code can go into CVS mainline (protected by some #ifdef for example), then consider creating a development branch.
Created attachment 1214 [details] patch for perfmon tapset Revision to the patch: -configure options to control building code that uses perfmon2 in translator -localize the perfmon code generations to tapsets.cxx -uses libpfm to map the event names and configuration to magic bits TODO: -eliminate requirement for perform2 includes to build stap kernel modules -code to allow script to read perfmon counts -move probes_{allocated|registers|deregistered} and event_list into perfmon_derived_probes
Created attachment 1247 [details] latest (non-work) revision of the perfmon2 systemtap support The attached patch compile. It has a "$counter" target variable for a handle to indentify which counter is being used to store that event. However the translator can't find the new variable and seg faults on the following example: global handle global cycles_start, cycles_end probe perfmon.counter("CPU_CLK_UNHALTED") {handle=$counter} probe begin { cycles_start = read_counter(handle) } probe end { cycles_end = read_counter(handle); elapsed = cycles_start - cycles_end; printf("%d cycles\n", elapsed); } (gdb) run -p4 -k ../elapsed.stp Starting program: /home/wcohen/research/profiling/systemtap_perfmon/install/bin/stap -p4 -k ../elapsed.stp while searching for arity 0 function: semantic error: unresolved function call: identifier '$counter' at ../elapsed.stp:4:51 Program received signal SIGSEGV, Segmentation fault. 0x000000000042e60f in symresolution_info::visit_symbol (this=0x7fffd7223960, e=0xb34790) at /usr/lib/gcc/x86_64-redhat-linux/4.0.2/../../../../include/c++/4.0.2/ext/mt_allocator.h:585 585 { ::new(__p) _Tp(__val); } (gdb) where #0 0x000000000042e60f in symresolution_info::visit_symbol ( this=0x7fffd7223960, e=0xb34790) at /usr/lib/gcc/x86_64-redhat-linux/4.0.2/../../../../include/c++/4.0.2/ext/mt_allocator.h:585 #1 0x000000000041d221 in traversing_visitor::visit_assignment ( this=0x7fffd7223960, e=0xb34510) at ../src/staptree.cxx:1434 #2 0x000000000041f5af in assignment::visit (this=0xb34510, u=0x7fffd7223960) at ../src/staptree.cxx:1070 #3 0x00000000004292dd in symresolution_info::visit_block ( this=0x7fffd7223960, e=0xb34370) at ../src/elaborate.cxx:973 #4 0x000000000042d8e4 in semantic_pass_symbols (s=@0x7fffd7223d50) at ../src/elaborate.cxx:891 #5 0x00000000004307e1 in semantic_pass (s=@0x7fffd7223d50) at ../src/elaborate.cxx:916 #6 0x000000000040830e in main (argc=Variable "argc" is not available. ) at ../src/main.cxx:468
Created attachment 1248 [details] Patch that has $counter probe variable working with perfmon A pass by reference was missing in the code. This patch allows the probe to use the $counter to get a handle to uniquely identify which register is being used.
Created attachment 1249 [details] A simple processor (x86_64) specific example that exercises the permon hw elapsed.stp is a very simple example that shows how the $counter variable is used. For this to work you will need libpfm from perfmon2.sourceforge.net installed, a kernel with the matching perfmon patches, and configure stap with: --enable-perfmon -Will
BTW "--enable-perfmon" could be auto-detected by an autoconf test looking for libpfm.
Created attachment 1252 [details] incorporating generic cycles and instructions
Created attachment 1255 [details] Perfmon probes with documentation and test The new patch has a simple check to exercise the perfmon probes code in the translator and some some documentation in stapfuncs and stapprobes describing how the code works. For the time being the perfmon01.stp test is KFAIL because most stock machines are not going to be setup with the proper kernel to run this. There are some caveats with the current code: -need to have a kernel with perfmon2 patches applied -need to have explicitly load the appropriate perfmon kernel module, e.g. perfmon_amd64 before running systemtap perfmon probe work around is to run pfmon on something to load the module I ran the "make installcheck" the code doesn't seem to instroduce new regressions.
Proposed ChangeLog entry. 2006-08-27 William Cohen <wcohen@redhat.com> * configure: Flag for enabling perfmon support. * configure.ac: Regenerated. * main.cxx: * session.h: * tapsets.cxx: * translate.cxx: * runtime/perf.c * runtime/perf.h * runtime/runtime.h * tapset/perfmon.stp: Support for perfmon hardware probes. * stapfuncs.5.in: * stapprobes.5.in: Documentation on perfmon probes * testsuite/buildok/perfmon01.stp: * testsuite/systemtap.pass1-4/buildok.exp: Test for perfmon probes.
Created attachment 1294 [details] systemtap perfmon support working with current snapshot 20060911 TODOS on patch: -make sure perfmon module installed before using kernel perfmon api how doe pfmon make sure that the module is installed? -enable smp -perfmon sampling -cross system use (deferring)
FYI: The perf_counter API has been pulled in the 2.6.31 merge window. See: http://marc.info/?l=linux-kernel&m=124473633222515
(Assuming 2.6.33's changes enable this.)
Created attachment 4575 [details] Simple C code to exercise the perf event kernel-api The perf_period.tar.gz is a tarball containing a couple very simple examples to see how to use the perf event kernel-api. Once the code is unpacked the perf_period.ko and perf_period_smp.ko are built with: make -C "/lib/modules/`uname -r`/build" M=`pwd` ARCH="x86_64" modules V=1 These perf_period.c only samples on cpu0. The perf_pueriod_smp.c sets up sampling on all the processors on the machine. The perf_period.ko generates output like the following in /var/log/messages: ... Feb 5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event ffff8801ba7f0448 count 615 Feb 5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event ffff8801ba7f0448 count 616 Feb 5 15:12:19 dhcp231-201 kernel: sample_event_handler cpu0 event ffff8801ba7f0448 count 617 Feb 5 15:12:19 dhcp231-201 kernel: event sampling shutdown The perf_period_smp.ko module just tallys the number of times that each processor has an event overflow: Feb 5 15:13:44 dhcp231-201 kernel: event sampling setup Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu0 event ffff8801b2dfeb08 count 22 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu1 event ffff8801b2dfd9e8 count 18 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu2 event ffff8801b2dfaad0 count 0 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu3 event ffff8801b2df9df8 count 7 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu4 event ffff8801b2dfc8c8 count 10 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu5 event ffff8801b2df8890 count 13 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu6 event ffff8801b2dfe6c0 count 0 Feb 5 15:13:51 dhcp231-201 kernel: sample_event_handler cpu7 event ffff8801b2dfe278 count 0 Feb 5 15:13:51 dhcp231-201 kernel: event sampling shutdown
The 2.6.33 kernel provides a kernel-api for accessing the performance event. The interface consists of three functions: extern struct perf_event * perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, pid_t pid, perf_overflow_handler_t callback); extern u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running); extern int perf_event_release_kernel(struct perf_event *event); The perf_event_create_kernel_counters() sets up a performance events. The perf_event_read_value() reads the current value for the performance event and the perf_event_release_kernel() frees the performance event when it is no longer used. There isn't really a concept of global count. The performance events can be associated with a pid or a process (but not both). The performance events can also be set up to trigger a call back when the count is execeeded. The code for perf_event_read_value() uses spinlocks and inter-processor interrupts, so it isn't clear that this function will work for all the possible situations that a might occur in a systemtap probe handler. However the environment for the callback function has restrictions similar to those of systemtap probe handlers. Thus, to make things managable the performance event will be limited to sampling. The syntax would look similar to the timer probe: probe perf.EVENT(N) { /* */ } EVENT would be one of the following (might need to replace '-' with '_'): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] N would be the interval between samples To aid implementation there would be a supporting struct and two runtime functions: struct _Perf { /* per-cpu data. allocated with _stp_alloc_percpu() */ stat *pd; }; typedef struct _Perf *Perf; static Perf _stp_perf_init (struct perf_event_attr *attr, perf_overflow_handler_t callback); static void _stp_perf_del (Perf pe); Need to have a test to check to see if the kernel will support the performance kernel perf api.
Created attachment 4590 [details] Proto runtime functions to access kernel perf api This is a first pass at a systemtap/runtime/perf.c that contains functions to implement add perf support to systemtap
Created attachment 4591 [details] Header file describing the perf data structures for systemtap runtime This is the prototype header file describing the perf data structures.
Created attachment 4592 [details] Very simple test code to check operation of the perf runtime The perf_test1.stp is mainly to test that the perf runtime compiles and runs on a kernel that has the perf kernel abi (2.6.33-rc*). The test is run with: /usr/local/bin/stap -k -g -p4 -mperf_test1 -vvv /home/wcohen/perf_test1.stp /usr/local/bin/staprun perf_test1.ko Unfortunately, the perf_event_create_kernel_counter() returns an error and then the module oops in the _stp_kfree(pe->pd) after exit2 label
Created attachment 4593 [details] Revised runtime functions to access kernel perf api This perf.c goes in the runtime directory. It corrects the freeing of the percpu object. One needs to be root for this runtime code to operate. Because the kernel has the following code in kernel/perf_event.c which limits the per cpu set up to root: static struct perf_event_context *find_get_context(pid_t pid, int cpu) { ... if (pid == -1 && cpu != -1) { /* Must be root to operate on a CPU event: */ if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN)) return ERR_PTR(-EACCES); if (!attr.exclude_kernel) { if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) return -EACCES; }
Created attachment 4594 [details] A systemtap script to check that perf runtime sets up sampling The perf_test2.stp is a really stupid example that is compiled and run with: /usr/local/bin/stap -k -g -p4 -mperf_test2 -vvv /home/wcohen/perf_test2.stp As root: /usr/local/bin/staprun /home/wcohen/perf_test2.ko In the /var/log/messages will see something like the following: Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu0 count 4 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu1 count 3 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu2 count 1 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu3 count 1 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu4 count 0 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu5 count 7 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu6 count 0 Feb 12 16:42:01 dhcp231-201 kernel: sample_event_handler cpu7 count 0
Created attachment 4664 [details] quick dump of what is in the local tree This is a quick dump of what is in the local tree.
Created attachment 4666 [details] initial runtime patch
Created attachment 4667 [details] runtime change to make it easier to funnel perf interrupts through same function
Created attachment 4668 [details] strip out the old perfmon2 based code in the translator
Created attachment 4669 [details] provides a perf.cycles(num) probe tapset entry in translator
I'm working from wcohen's patches to see if we can get this in before the next release...
I pushed my changes with Will's out to git now, so testing and feedback would be appreciated. A few TODOs and notes: - Wildcards. While I don't think it's that useful to probe too many events at once, we still want to do things like: $ stap -l 'perf.events("*")' - Context variables. Probably need $name at least, others? - Replan the runtime/perf data structures. At the moment there's some redundant/unused variables. I think we can also do away with the separate entry handler on for each probe by scanning for a matching perf_event handle in our saved list. - Version check. I'd like a better way to determine availability besides CONFIG_PERF_EVENTS, as that predates the actual kernel API that we need. I think we might need to just dig in System.Map for perf_event_create_kernel_counter. - Hard-coded events. This is not ideal, but the perf tool also does this. At least it's supposed to be ABI-stable. We might also consider using libpfm4 to get event info dynamically. - Other sampling/counting modes. Right now, we're just attaching events to every cpu and sampling the overflows. We may later want to attach to specific processes instead, and perhaps expose the event counters somehow to other probe points.
*** Bug 5632 has been marked as a duplicate of this bug. ***
Let's consider the basic functionality done; new functions to be tracked separately.