Bug 6960 - script for latencytop-like functionality for older (rhel5) kernels
Summary: script for latencytop-like functionality for older (rhel5) kernels
Status: RESOLVED FIXED
Alias: None
Product: systemtap
Classification: Unclassified
Component: tapsets (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: William Cohen
URL:
Keywords:
Depends on: 10665 12026
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-14 11:18 UTC by Frank Ch. Eigler
Modified: 2010-09-24 19:23 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
proto latencytap script (724 bytes, application/octet-stream)
2009-09-18 02:24 UTC, William Cohen
Details
Attempt to make latencytap script more portable (780 bytes, application/octet-stream)
2009-09-18 19:24 UTC, William Cohen
Details
Record backtrace in the output. (732 bytes, text/plain)
2010-08-10 19:29 UTC, William Cohen
Details
perl script that runs the latencytap script translates backtraces (624 bytes, text/plain)
2010-08-13 19:39 UTC, William Cohen
Details
Latencytop translation file (2.01 KB, text/plain)
2010-08-13 19:40 UTC, William Cohen
Details
systemtap script part of instrumentation (675 bytes, patch)
2010-08-13 19:44 UTC, William Cohen
Details | Diff
Fold the reason generating logic into the systemtap script (3.35 KB, text/plain)
2010-08-13 21:21 UTC, William Cohen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Ch. Eigler 2008-10-14 11:18:13 UTC
The latencytop kernel-side code uses hooks in the scheduler that we may be
able to approximate with kprobes or whatever markers may be available.
Comment 1 Frank Ch. Eigler 2009-03-06 17:24:04 UTC
.../testsuite/*examples/.../latencytap.stp is a sketch
Comment 2 Lans Carstensen 2009-06-06 00:11:25 UTC
The sketch at testsuite/systemtap.examples/profiling/latencytap.stp appears
useful for recent kernels, but isn't functional for RHEL5 kernels as CFS wasn't
yet incorporated.  It appears something like the following is needed:

%( kernel_v <= "2.6.18" %?

probe kernel.function("__activate_task") {

[...]

%)

but I'm not familiar enough with the scheduler structures to understand how to
properly rewrite _get_sleep_time, where the sched_entity structure doesn't exist
on RHEL5.  Could someone help shed some illumination my way?
Comment 3 William Cohen 2009-07-08 18:08:49 UTC
Looked into using scheduler.cpu_on scheduler.cpu_off to probe for this
information. However, the parameters are not available at the tapset probe
points in RHEL5. The RHEL5.4 kernel will have tracepoints that could be used for
this type of thing. Would it be acceptable to use the tracepoints that would be
in RHEL5 and upstream kernels?
Comment 4 Lans Carstensen 2009-07-08 19:03:18 UTC
Thanks for the update.

I'm fine with moving to stock tracepoints/markers in theory, but don't see any
scheduler markers/tracepoints in RHEL 5.4 beta (kernel-2.6.18-155.el5).  Do you
have a test kernel I could try?
FWIW, /usr/src/kernels/2.6.18-155.el5-x86_64/Module.markers only shows markers
for jbd2 and ext4 - I'm assuming that's the place I should be looking, right?
Comment 5 William Cohen 2009-07-08 19:53:19 UTC
Tracepoints don't use the markers. If you have a recent vintage of sytemtap
(0.9.7), you should be able to do the following to get a list of available trace
points:

 stap -L 'kernel.trace("*")'|sort

You should get something like the following as output:


kernel.trace("activate_task") $p:struct task_struct* $rq:struct rq*
kernel.trace("add_to_page_cache") $mapping:struct address_space* $offset:long
unsigned int
kernel.trace("block_bio_backmerge") $q:struct request_queue* $bio:struct bio*
kernel.trace("block_bio_bounce") $q:struct request_queue* $bio:struct bio*
kernel.trace("block_bio_complete") $q:struct request_queue* $bio:struct bio*
kernel.trace("block_bio_frontmerge") $q:struct request_queue* $bio:struct bio*
kernel.trace("block_bio_queue") $q:struct request_queue* $bio:struct bio*
kernel.trace("block_getrq") $q:struct request_queue* $bio:struct bio* $rw:int
kernel.trace("block_plug") $q:struct request_queue*
kernel.trace("block_remap") $q:struct request_queue* $bio:struct bio* $dev:dev_t
$from:sector_t $to:sector_t
kernel.trace("block_rq_complete") $q:struct request_queue* $rq:struct request*
kernel.trace("block_rq_insert") $q:struct request_queue* $rq:struct request*
kernel.trace("block_rq_issue") $q:struct request_queue* $rq:struct request*
kernel.trace("block_rq_requeue") $q:struct request_queue* $rq:struct request*
kernel.trace("block_sleeprq") $q:struct request_queue* $bio:struct bio* $rw:int
kernel.trace("block_split") $q:struct request_queue* $bio:struct bio*
$pdu:unsigned int
kernel.trace("block_unplug_io") $q:struct request_queue*
kernel.trace("block_unplug_timer") $q:struct request_queue*
kernel.trace("deactivate_task") $p:struct task_struct* $rq:struct rq*
kernel.trace("irq_entry") $id:unsigned int $regs:struct pt_regs*
kernel.trace("irq_exit") $id:unsigned int $retval:irqreturn_t
kernel.trace("irq_softirq_entry") $h:struct softirq_action* $softirq_vec:struct
softirq_action*
kernel.trace("irq_softirq_exit") $h:struct softirq_action* $softirq_vec:struct
softirq_action*
kernel.trace("irq_tasklet_high_entry") $t:struct tasklet_struct*
kernel.trace("irq_tasklet_high_exit") $t:struct tasklet_struct*
kernel.trace("irq_tasklet_low_entry") $t:struct tasklet_struct*
kernel.trace("irq_tasklet_low_exit") $t:struct tasklet_struct*
kernel.trace("kfree_skb") $skb:struct sk_buff* $location:void*
kernel.trace("mm_anon_cow") $mm:struct mm_struct* $address:long unsigned int
$page:struct page*
kernel.trace("mm_anon_fault") $mm:struct mm_struct* $address:long unsigned int
$page:struct page*
kernel.trace("mm_anon_pgin") $mm:struct mm_struct* $address:long unsigned int
$page:struct page*
kernel.trace("mm_anon_unmap") $page:struct page* $success:int
kernel.trace("mm_anon_userfree") $mm:struct mm_struct* $address:long unsigned
int $page:struct page*
kernel.trace("mm_directreclaim_reclaimall") $priority:long unsigned int
kernel.trace("mm_directreclaim_reclaimzone") $reclaimed:long unsigned int
kernel.trace("mm_filemap_cow") $mm:struct mm_struct* $address:long unsigned int
$page:struct page*
kernel.trace("mm_filemap_fault") $mm:struct mm_struct* $address:long unsigned
int $page:struct page*
kernel.trace("mm_filemap_unmap") $page:struct page* $success:int
kernel.trace("mm_filemap_userunmap") $mm:struct mm_struct* $address:long
unsigned int $page:struct page*
kernel.trace("mm_kswapd_runs") $reclaimed:long unsigned int
kernel.trace("mm_page_allocation") $page:struct page* $free:long unsigned int
kernel.trace("mm_page_free") $page:struct page*
kernel.trace("mm_pagereclaim_free") $page:struct page* $anon:int
kernel.trace("mm_pagereclaim_pgout") $page:struct page* $anon:int
kernel.trace("mm_pagereclaim_shrinkactive_a2a") $page:struct page*
kernel.trace("mm_pagereclaim_shrinkactive_a2i") $page:struct page*
kernel.trace("mm_pagereclaim_shrinkactive") $scanned:long unsigned int
kernel.trace("mm_pagereclaim_shrinkinactive_i2a") $page:struct page*
kernel.trace("mm_pagereclaim_shrinkinactive_i2i") $page:struct page*
kernel.trace("mm_pagereclaim_shrinkinactive") $reclaimed:long unsigned int
kernel.trace("mm_pagereclaim_shrinkzone") $reclaimed:long unsigned int
kernel.trace("mm_pdflush_bgwriteout") $count:long unsigned int
kernel.trace("mm_pdflush_kupdate") $count:long unsigned int
kernel.trace("napi_poll") $napi:struct net_device*
kernel.trace("remove_from_page_cache") $mapping:struct address_space*
$offset:long unsigned int
kernel.trace("rpc_bind_status") $task:struct rpc_task*
kernel.trace("rpc_call_status") $task:struct rpc_task*
kernel.trace("rpc_connect_status") $task:struct rpc_task* $status:int
kernel.trace("sched_process_exit") $p:struct task_struct*
kernel.trace("sched_process_fork") $parent:struct task_struct* $child:struct
task_struct*
kernel.trace("sched_process_free") $p:struct task_struct*
kernel.trace("sched_process_wait") $pid:pid_t
kernel.trace("sched_switch") $rq:struct rq* $prev:struct task_struct*
$next:struct task_struct*
kernel.trace("sched_wakeup_new") $rq:struct rq* $p:struct task_struct* $success:int
kernel.trace("sched_wakeup") $rq:struct rq* $p:struct task_struct* $success:int
kernel.trace("socket_recvmsg") $sock:struct socket* $msg:struct msghdr*
$size:size_t $flags:int $ret:int
kernel.trace("socket_sendmsg") $sock:struct socket* $msg:struct msghdr*
$size:size_t $ret:int
kernel.trace("socket_sendpage") $sock:struct socket* $page:struct page*
$offset:int $size:size_t $flags:int $ret:int
Comment 6 Lans Carstensen 2009-07-09 16:16:28 UTC
Thank you!

So, the trace on "activate_task" seems to be the perfect magic we'd want to use
to measure kernel task latencies, as it's what you instrumented before.  It
requires working with the task_struct's on the request queue, however, which is
where I was before in trying to figure out how to rewrite the _get_sleep_time
function.

Any further hints/help for me on actually collecting the latencies this way, or
should I try and power through it sometime on my own?
Comment 7 William Cohen 2009-07-09 16:33:16 UTC
An additional associative in the script will need to keep track of when the task
when to sleep because the task doesn't have that data structure.

systemtap.examples/process/sleepingBeauties.stp probes wait_for_completion. That
would be one place to probe. However, that is not be sufficient. I need to trace
through the code on http://lxr.linux.no/ to find where the latencytop data is
getting set.

I plan to some more diving through the kernel code this evening. However, feel
free to explore before this.
Comment 8 William Cohen 2009-09-15 15:03:29 UTC
Looked at the _stp_stack_print_tsk() code to see whether this can be used. There
are two main drawbacks with using _stp_stack_print_tsk() from the runtime lib:

1) Uses save_stack_trace_tsk() in a form that is only available in 2.6.31 kernels
2) Even in the 2.6.31 kernels only x86 and parisc archs have support

RHEL 5 uses an older kernel and supports a number of processors in addition to x86.
Comment 9 William Cohen 2009-09-18 02:24:58 UTC
Created attachment 4203 [details]
proto latencytap script

The attached patch is a step towards providing latencytap functionality with
systemtap on older kernel. However, there are a number of limitations with the
script:

-Only works with 2.6.31 kernel
-Only works on x86 machine
-Uses a questionable runtime function that does not protect against invalid
pointers, task_backtrace($p)
-Uses guru mode code to access parts of the task_struct to compute sleep time
-The Kernel needs to be compiled with CONFIG_SCHEDSTATS
Comment 10 William Cohen 2009-09-18 19:24:08 UTC
Created attachment 4208 [details]
Attempt to make latencytap script more portable

This revision of the latencytap script attempts to remove some of the
dependencies of the earlier one on recent kernel and the problematic
task_backtrace() function. In theory this should make it portable to older
kernels (e.g. RHEL5)

However, there is a problem with this script. The backtrace() function is too
slow in the probe scheduler.cpu_on and causes probes to be skipped when it is
used.
Comment 11 William Cohen 2010-08-10 19:29:35 UTC
Created attachment 4924 [details]
Record backtrace in the output.

The backtrace() function is much faster in systemtap 1.3. This version of
latencytap enables storing the backtraces. It has been run on rhel5 and rhel6
kernels.  End up with something like the following from a rhel6 machine:

 ../install/bin/stap -v /tmp/latencytap6.stp 
Pass 1: parsed user script and 80 library script(s) using
93988virt/21792res/2460shr kb, in 110usr/0sys/293real ms.
Pass 2: analyzed script: 5 probe(s), 13 function(s), 4 embed(s), 4 global(s)
using 240800virt/47684res/3676shr kb, in 1900usr/1510sys/10528real ms.
Pass 3: translated to C into
"/tmp/stapcEMM7U/stap_e35e911e617667d844ac44e18079673b_6329.c" using
238284virt/52040res/8172shr kb, in 460usr/10sys/541real ms.
Pass 4: compiled C into "stap_e35e911e617667d844ac44e18079673b_6329.ko" in
9740usr/1240sys/16189real ms.
Pass 5: starting run.
Starting data collection
mode - All Pids

^C
25237 204850 2270 0x0 
8084 23146 1001 finish_task_switch thread_return schedule_hrtimeout_range
poll_schedule_timeout do_select core_sys_select sys_select system_call_fastpath

5730 4558 134 finish_task_switch thread_return 0xffffffffa02d8a25
autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2 vfs_ioctl
do_vfs_ioctl sys_ioctl system_call_fastpath 
5666 48707 2144 finish_task_switch thread_return worker_thread kthread
child_rip kthread child_rip 
3441 14601 943 finish_task_switch thread_return ksoftirqd kthread child_rip
kthread child_rip 
864 28019 2270 finish_task_switch thread_return schedule_hrtimeout_range
poll_schedule_timeout do_sys_poll sys_poll system_call_fastpath 
583 16867 2125 finish_task_switch thread_return schedule_hrtimeout_range
poll_schedule_timeout do_sys_poll sys_poll system_call_fastpath 
417 39806 1000 finish_task_switch thread_return futex_wait_queue_me futex_wait
do_futex sys_futex system_call_fastpath 
253 12045 2105 finish_task_switch thread_return migration_thread kthread
child_rip kthread child_rip 
52 2189 134 finish_task_switch thread_return 0xffffffffa02d8a25
autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2
call_function_single_interrupt 0xffffffffa02d5cd0 vfs_ioctl do_vfs_ioctl
sys_ioctl system_call_fastpath 
33 8752 1000 finish_task_switch thread_return do_nanosleep hrtimer_nanosleep
sys_nanosleep system_call_fastpath 
17 3403 200 finish_task_switch thread_return schedule_hrtimeout_range
poll_schedule_timeout do_sys_poll sys_ppoll system_call_fastpath 
13 44 12 finish_task_switch thread_return io_schedule sync_page
__wait_on_bit_lock __lock_page handle_pte_fault handle_mm_fault do_page_fault
page_fault 
11 0 0 finish_task_switch thread_return 0xffffffffa00aa35a
autoremove_wake_function 0xffffffffa00aa670 0xffffffffa00ab557
0xffffffffa00bb379 0xffffffffa00ab91b 0xffffffffa004bd7a 0xffffffffa00ab840
vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 
11 126 27 finish_task_switch thread_return io_schedule sync_page
__wait_on_bit_lock __lock_page find_lock_page filemap_fault __do_fault
handle_pte_fault handle_mm_fault do_page_fault page_fault 
7 128 33 finish_task_switch thread_return io_schedule 0xffffffffa000238b
default_wake_function 0xffffffffa00037a3 0xffffffffa0003804 0xffffffffa00037b0
worker_thread autoremove_wake_function worker_thread kthread child_rip kthread
child_rip 
4 1 1 finish_task_switch thread_return __mutex_lock_slowpath mutex_lock
0xffffffffa00aa6bd 0xffffffffa00aa680 worker_thread autoremove_wake_function
worker_thread kthread child_rip kthread child_rip 
4 0 0 finish_task_switch thread_return schedule_timeout wait_for_common
wait_for_completion synchronize_sched_expedited __synchronize_srcu
synchronize_srcu_expedited 0xffffffffa02e4901 0xffffffffa02d94a7 avc_has_perm
avc_has_perm inode_has_perm ktime_get lapic_next_event tick_dev_program_event
vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 
4 0 0 finish_task_switch thread_return schedule_timeout wait_for_common
wait_for_completion synchronize_sched_expedited __synchronize_srcu
synchronize_srcu_expedited 0xffffffffa02e4901 0xffffffffa02d94a7 avc_has_perm
avc_has_perm inode_has_perm ktime_get lapic_next_event tick_dev_program_event
vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 
4 179 107 finish_task_switch thread_return 0xffffffffa02d8a25
autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2 do_futex
vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 
Pass 5: run completed in 10usr/60sys/4047real ms.


Some of the output looks a bit odd. The values that begin with 0xffffffffa02d
and 0xffffffffa00a
Comment 12 Mark Wielaard 2010-08-10 20:12:23 UTC
(In reply to comment #11)
> Some of the output looks a bit odd. The values that begin with 0xffffffffa02d
> and 0xffffffffa00a

Turned out those were kernel module addresses. Adding --all-modules to the stap
invocation made them resolve to symbols too.
Comment 13 William Cohen 2010-08-13 19:39:50 UTC
Created attachment 4930 [details]
perl script that runs the latencytap script translates backtraces

A prototype perl script that makes use of the latencytop.trans file to convert
the back traces into into more human readable output. The script expects the
perl scrip needs a new version of systemtap that supports the --all-modules
option. It also expects the latencytap9.stp and latencytop.trans files to be in
/tmp.
Comment 14 William Cohen 2010-08-13 19:40:31 UTC
Created attachment 4931 [details]
Latencytop translation file
Comment 15 William Cohen 2010-08-13 19:44:27 UTC
Created attachment 4932 [details]
systemtap script part of instrumentation
Comment 16 William Cohen 2010-08-13 21:21:59 UTC
Created attachment 4933 [details]
Fold the reason generating logic into the systemtap script

This is work in progress. The script now builds hash table for reasons based on
the backtraces. Would like to eliminate the cpu.on probe and just have that
bookkeeping done in the activate_task probe, by using the task_backtrace()
function.
Comment 17 William Cohen 2010-09-14 18:27:40 UTC
Git commit dadb3ad90b6f9df1ec1e7ca132a1c682fb93303c checks in the current
latencytap.stp.

TODOs:
-If possible, use the task_backtrace() to eliminate overhead of the
 scheduler.cpu_on probe and the this_sleep global associative array
-Combine same reasons together in output (different backtraces can have same 
 reason)
-Provide better mechanism to focus on processes (right now allow global summary
 or tracking single target pid when the script starts up)
Comment 18 William Cohen 2010-09-24 19:23:03 UTC
The current version of latencytap.stp is checked in and there is a
latencytap.meta file for it. Will call this done for now there are a number of
improvements that could be made to the script:

-Avoid the schedule.cpu_on probe and associate overhead mentioned with changes
mentioned in pr12026
-Have alternative views of the data like:
    -view of latency by pid
    -divide latencies into interruptible and non-interruptible
-Add mapping for backtrace functions for additional reasons

Some discussion about these enhancements in the following email:
http://sources.redhat.com/ml/systemtap/2010-q3/msg00455.html