The latencytop kernel-side code uses hooks in the scheduler that we may be able to approximate with kprobes or whatever markers may be available.
.../testsuite/*examples/.../latencytap.stp is a sketch
The sketch at testsuite/systemtap.examples/profiling/latencytap.stp appears useful for recent kernels, but isn't functional for RHEL5 kernels as CFS wasn't yet incorporated. It appears something like the following is needed: %( kernel_v <= "2.6.18" %? probe kernel.function("__activate_task") { [...] %) but I'm not familiar enough with the scheduler structures to understand how to properly rewrite _get_sleep_time, where the sched_entity structure doesn't exist on RHEL5. Could someone help shed some illumination my way?
Looked into using scheduler.cpu_on scheduler.cpu_off to probe for this information. However, the parameters are not available at the tapset probe points in RHEL5. The RHEL5.4 kernel will have tracepoints that could be used for this type of thing. Would it be acceptable to use the tracepoints that would be in RHEL5 and upstream kernels?
Thanks for the update. I'm fine with moving to stock tracepoints/markers in theory, but don't see any scheduler markers/tracepoints in RHEL 5.4 beta (kernel-2.6.18-155.el5). Do you have a test kernel I could try? FWIW, /usr/src/kernels/2.6.18-155.el5-x86_64/Module.markers only shows markers for jbd2 and ext4 - I'm assuming that's the place I should be looking, right?
Tracepoints don't use the markers. If you have a recent vintage of sytemtap (0.9.7), you should be able to do the following to get a list of available trace points: stap -L 'kernel.trace("*")'|sort You should get something like the following as output: kernel.trace("activate_task") $p:struct task_struct* $rq:struct rq* kernel.trace("add_to_page_cache") $mapping:struct address_space* $offset:long unsigned int kernel.trace("block_bio_backmerge") $q:struct request_queue* $bio:struct bio* kernel.trace("block_bio_bounce") $q:struct request_queue* $bio:struct bio* kernel.trace("block_bio_complete") $q:struct request_queue* $bio:struct bio* kernel.trace("block_bio_frontmerge") $q:struct request_queue* $bio:struct bio* kernel.trace("block_bio_queue") $q:struct request_queue* $bio:struct bio* kernel.trace("block_getrq") $q:struct request_queue* $bio:struct bio* $rw:int kernel.trace("block_plug") $q:struct request_queue* kernel.trace("block_remap") $q:struct request_queue* $bio:struct bio* $dev:dev_t $from:sector_t $to:sector_t kernel.trace("block_rq_complete") $q:struct request_queue* $rq:struct request* kernel.trace("block_rq_insert") $q:struct request_queue* $rq:struct request* kernel.trace("block_rq_issue") $q:struct request_queue* $rq:struct request* kernel.trace("block_rq_requeue") $q:struct request_queue* $rq:struct request* kernel.trace("block_sleeprq") $q:struct request_queue* $bio:struct bio* $rw:int kernel.trace("block_split") $q:struct request_queue* $bio:struct bio* $pdu:unsigned int kernel.trace("block_unplug_io") $q:struct request_queue* kernel.trace("block_unplug_timer") $q:struct request_queue* kernel.trace("deactivate_task") $p:struct task_struct* $rq:struct rq* kernel.trace("irq_entry") $id:unsigned int $regs:struct pt_regs* kernel.trace("irq_exit") $id:unsigned int $retval:irqreturn_t kernel.trace("irq_softirq_entry") $h:struct softirq_action* $softirq_vec:struct softirq_action* kernel.trace("irq_softirq_exit") $h:struct softirq_action* $softirq_vec:struct softirq_action* kernel.trace("irq_tasklet_high_entry") $t:struct tasklet_struct* kernel.trace("irq_tasklet_high_exit") $t:struct tasklet_struct* kernel.trace("irq_tasklet_low_entry") $t:struct tasklet_struct* kernel.trace("irq_tasklet_low_exit") $t:struct tasklet_struct* kernel.trace("kfree_skb") $skb:struct sk_buff* $location:void* kernel.trace("mm_anon_cow") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_anon_fault") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_anon_pgin") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_anon_unmap") $page:struct page* $success:int kernel.trace("mm_anon_userfree") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_directreclaim_reclaimall") $priority:long unsigned int kernel.trace("mm_directreclaim_reclaimzone") $reclaimed:long unsigned int kernel.trace("mm_filemap_cow") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_filemap_fault") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_filemap_unmap") $page:struct page* $success:int kernel.trace("mm_filemap_userunmap") $mm:struct mm_struct* $address:long unsigned int $page:struct page* kernel.trace("mm_kswapd_runs") $reclaimed:long unsigned int kernel.trace("mm_page_allocation") $page:struct page* $free:long unsigned int kernel.trace("mm_page_free") $page:struct page* kernel.trace("mm_pagereclaim_free") $page:struct page* $anon:int kernel.trace("mm_pagereclaim_pgout") $page:struct page* $anon:int kernel.trace("mm_pagereclaim_shrinkactive_a2a") $page:struct page* kernel.trace("mm_pagereclaim_shrinkactive_a2i") $page:struct page* kernel.trace("mm_pagereclaim_shrinkactive") $scanned:long unsigned int kernel.trace("mm_pagereclaim_shrinkinactive_i2a") $page:struct page* kernel.trace("mm_pagereclaim_shrinkinactive_i2i") $page:struct page* kernel.trace("mm_pagereclaim_shrinkinactive") $reclaimed:long unsigned int kernel.trace("mm_pagereclaim_shrinkzone") $reclaimed:long unsigned int kernel.trace("mm_pdflush_bgwriteout") $count:long unsigned int kernel.trace("mm_pdflush_kupdate") $count:long unsigned int kernel.trace("napi_poll") $napi:struct net_device* kernel.trace("remove_from_page_cache") $mapping:struct address_space* $offset:long unsigned int kernel.trace("rpc_bind_status") $task:struct rpc_task* kernel.trace("rpc_call_status") $task:struct rpc_task* kernel.trace("rpc_connect_status") $task:struct rpc_task* $status:int kernel.trace("sched_process_exit") $p:struct task_struct* kernel.trace("sched_process_fork") $parent:struct task_struct* $child:struct task_struct* kernel.trace("sched_process_free") $p:struct task_struct* kernel.trace("sched_process_wait") $pid:pid_t kernel.trace("sched_switch") $rq:struct rq* $prev:struct task_struct* $next:struct task_struct* kernel.trace("sched_wakeup_new") $rq:struct rq* $p:struct task_struct* $success:int kernel.trace("sched_wakeup") $rq:struct rq* $p:struct task_struct* $success:int kernel.trace("socket_recvmsg") $sock:struct socket* $msg:struct msghdr* $size:size_t $flags:int $ret:int kernel.trace("socket_sendmsg") $sock:struct socket* $msg:struct msghdr* $size:size_t $ret:int kernel.trace("socket_sendpage") $sock:struct socket* $page:struct page* $offset:int $size:size_t $flags:int $ret:int
Thank you! So, the trace on "activate_task" seems to be the perfect magic we'd want to use to measure kernel task latencies, as it's what you instrumented before. It requires working with the task_struct's on the request queue, however, which is where I was before in trying to figure out how to rewrite the _get_sleep_time function. Any further hints/help for me on actually collecting the latencies this way, or should I try and power through it sometime on my own?
An additional associative in the script will need to keep track of when the task when to sleep because the task doesn't have that data structure. systemtap.examples/process/sleepingBeauties.stp probes wait_for_completion. That would be one place to probe. However, that is not be sufficient. I need to trace through the code on http://lxr.linux.no/ to find where the latencytop data is getting set. I plan to some more diving through the kernel code this evening. However, feel free to explore before this.
Looked at the _stp_stack_print_tsk() code to see whether this can be used. There are two main drawbacks with using _stp_stack_print_tsk() from the runtime lib: 1) Uses save_stack_trace_tsk() in a form that is only available in 2.6.31 kernels 2) Even in the 2.6.31 kernels only x86 and parisc archs have support RHEL 5 uses an older kernel and supports a number of processors in addition to x86.
Created attachment 4203 [details] proto latencytap script The attached patch is a step towards providing latencytap functionality with systemtap on older kernel. However, there are a number of limitations with the script: -Only works with 2.6.31 kernel -Only works on x86 machine -Uses a questionable runtime function that does not protect against invalid pointers, task_backtrace($p) -Uses guru mode code to access parts of the task_struct to compute sleep time -The Kernel needs to be compiled with CONFIG_SCHEDSTATS
Created attachment 4208 [details] Attempt to make latencytap script more portable This revision of the latencytap script attempts to remove some of the dependencies of the earlier one on recent kernel and the problematic task_backtrace() function. In theory this should make it portable to older kernels (e.g. RHEL5) However, there is a problem with this script. The backtrace() function is too slow in the probe scheduler.cpu_on and causes probes to be skipped when it is used.
Created attachment 4924 [details] Record backtrace in the output. The backtrace() function is much faster in systemtap 1.3. This version of latencytap enables storing the backtraces. It has been run on rhel5 and rhel6 kernels. End up with something like the following from a rhel6 machine: ../install/bin/stap -v /tmp/latencytap6.stp Pass 1: parsed user script and 80 library script(s) using 93988virt/21792res/2460shr kb, in 110usr/0sys/293real ms. Pass 2: analyzed script: 5 probe(s), 13 function(s), 4 embed(s), 4 global(s) using 240800virt/47684res/3676shr kb, in 1900usr/1510sys/10528real ms. Pass 3: translated to C into "/tmp/stapcEMM7U/stap_e35e911e617667d844ac44e18079673b_6329.c" using 238284virt/52040res/8172shr kb, in 460usr/10sys/541real ms. Pass 4: compiled C into "stap_e35e911e617667d844ac44e18079673b_6329.ko" in 9740usr/1240sys/16189real ms. Pass 5: starting run. Starting data collection mode - All Pids ^C 25237 204850 2270 0x0 8084 23146 1001 finish_task_switch thread_return schedule_hrtimeout_range poll_schedule_timeout do_select core_sys_select sys_select system_call_fastpath 5730 4558 134 finish_task_switch thread_return 0xffffffffa02d8a25 autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2 vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 5666 48707 2144 finish_task_switch thread_return worker_thread kthread child_rip kthread child_rip 3441 14601 943 finish_task_switch thread_return ksoftirqd kthread child_rip kthread child_rip 864 28019 2270 finish_task_switch thread_return schedule_hrtimeout_range poll_schedule_timeout do_sys_poll sys_poll system_call_fastpath 583 16867 2125 finish_task_switch thread_return schedule_hrtimeout_range poll_schedule_timeout do_sys_poll sys_poll system_call_fastpath 417 39806 1000 finish_task_switch thread_return futex_wait_queue_me futex_wait do_futex sys_futex system_call_fastpath 253 12045 2105 finish_task_switch thread_return migration_thread kthread child_rip kthread child_rip 52 2189 134 finish_task_switch thread_return 0xffffffffa02d8a25 autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2 call_function_single_interrupt 0xffffffffa02d5cd0 vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 33 8752 1000 finish_task_switch thread_return do_nanosleep hrtimer_nanosleep sys_nanosleep system_call_fastpath 17 3403 200 finish_task_switch thread_return schedule_hrtimeout_range poll_schedule_timeout do_sys_poll sys_ppoll system_call_fastpath 13 44 12 finish_task_switch thread_return io_schedule sync_page __wait_on_bit_lock __lock_page handle_pte_fault handle_mm_fault do_page_fault page_fault 11 0 0 finish_task_switch thread_return 0xffffffffa00aa35a autoremove_wake_function 0xffffffffa00aa670 0xffffffffa00ab557 0xffffffffa00bb379 0xffffffffa00ab91b 0xffffffffa004bd7a 0xffffffffa00ab840 vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 11 126 27 finish_task_switch thread_return io_schedule sync_page __wait_on_bit_lock __lock_page find_lock_page filemap_fault __do_fault handle_pte_fault handle_mm_fault do_page_fault page_fault 7 128 33 finish_task_switch thread_return io_schedule 0xffffffffa000238b default_wake_function 0xffffffffa00037a3 0xffffffffa0003804 0xffffffffa00037b0 worker_thread autoremove_wake_function worker_thread kthread child_rip kthread child_rip 4 1 1 finish_task_switch thread_return __mutex_lock_slowpath mutex_lock 0xffffffffa00aa6bd 0xffffffffa00aa680 worker_thread autoremove_wake_function worker_thread kthread child_rip kthread child_rip 4 0 0 finish_task_switch thread_return schedule_timeout wait_for_common wait_for_completion synchronize_sched_expedited __synchronize_srcu synchronize_srcu_expedited 0xffffffffa02e4901 0xffffffffa02d94a7 avc_has_perm avc_has_perm inode_has_perm ktime_get lapic_next_event tick_dev_program_event vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 4 0 0 finish_task_switch thread_return schedule_timeout wait_for_common wait_for_completion synchronize_sched_expedited __synchronize_srcu synchronize_srcu_expedited 0xffffffffa02e4901 0xffffffffa02d94a7 avc_has_perm avc_has_perm inode_has_perm ktime_get lapic_next_event tick_dev_program_event vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath 4 179 107 finish_task_switch thread_return 0xffffffffa02d8a25 autoremove_wake_function 0xffffffffa02ea77d 0xffffffffa02d61f2 do_futex vfs_ioctl do_vfs_ioctl sys_ioctl system_call_fastpath Pass 5: run completed in 10usr/60sys/4047real ms. Some of the output looks a bit odd. The values that begin with 0xffffffffa02d and 0xffffffffa00a
(In reply to comment #11) > Some of the output looks a bit odd. The values that begin with 0xffffffffa02d > and 0xffffffffa00a Turned out those were kernel module addresses. Adding --all-modules to the stap invocation made them resolve to symbols too.
Created attachment 4930 [details] perl script that runs the latencytap script translates backtraces A prototype perl script that makes use of the latencytop.trans file to convert the back traces into into more human readable output. The script expects the perl scrip needs a new version of systemtap that supports the --all-modules option. It also expects the latencytap9.stp and latencytop.trans files to be in /tmp.
Created attachment 4931 [details] Latencytop translation file
Created attachment 4932 [details] systemtap script part of instrumentation
Created attachment 4933 [details] Fold the reason generating logic into the systemtap script This is work in progress. The script now builds hash table for reasons based on the backtraces. Would like to eliminate the cpu.on probe and just have that bookkeeping done in the activate_task probe, by using the task_backtrace() function.
Git commit dadb3ad90b6f9df1ec1e7ca132a1c682fb93303c checks in the current latencytap.stp. TODOs: -If possible, use the task_backtrace() to eliminate overhead of the scheduler.cpu_on probe and the this_sleep global associative array -Combine same reasons together in output (different backtraces can have same reason) -Provide better mechanism to focus on processes (right now allow global summary or tracking single target pid when the script starts up)
The current version of latencytap.stp is checked in and there is a latencytap.meta file for it. Will call this done for now there are a number of improvements that could be made to the script: -Avoid the schedule.cpu_on probe and associate overhead mentioned with changes mentioned in pr12026 -Have alternative views of the data like: -view of latency by pid -divide latencies into interruptible and non-interruptible -Add mapping for backtrace functions for additional reasons Some discussion about these enhancements in the following email: http://sources.redhat.com/ml/systemtap/2010-q3/msg00455.html