From 8bec7eb3444e3a880bab121e300ba10c944d538a Mon Sep 17 00:00:00 2001 From: fche Date: Thu, 5 Oct 2006 16:41:21 +0000 Subject: [PATCH] * round 'em up * move 'em out --- tapsets/contextinfo/contextinfo.txt | 72 ---- tapsets/dynamic_cg/dynamic_cg.txt | 64 ---- tapsets/dynamic_cg/tapset.stp | 7 - tapsets/dynamic_cg/usage.stp | 30 -- tapsets/profile/profile_tapset.txt | 503 ------------------------- tapsets/timestamp/timestamp_tapset.txt | 327 ---------------- 6 files changed, 1003 deletions(-) delete mode 100644 tapsets/contextinfo/contextinfo.txt delete mode 100644 tapsets/dynamic_cg/dynamic_cg.txt delete mode 100644 tapsets/dynamic_cg/tapset.stp delete mode 100644 tapsets/dynamic_cg/usage.stp delete mode 100644 tapsets/profile/profile_tapset.txt delete mode 100644 tapsets/timestamp/timestamp_tapset.txt diff --git a/tapsets/contextinfo/contextinfo.txt b/tapsets/contextinfo/contextinfo.txt deleted file mode 100644 index 5f7f725f9..000000000 --- a/tapsets/contextinfo/contextinfo.txt +++ /dev/null @@ -1,72 +0,0 @@ -* Application name: probe context information ariables -* Contact: fche -* Motivation: let probes know where/how they were fired; introspective - probe handlers -* Background: discussions on mailing lists -* Target software: various -* Type of description: tapset variables -* Interesting probe points: n/a -* Interesting values: - - $pp_alias: string: the string specification of the probe point, as found - in the original .stp file, before alias and other - expansion - $pp: string: representation of this probe point, after alias and wildcard - expansion - $pp_function: string: source function (if available) - $pp_srcfile: string: source file name (if available) - $pp_srcline: number: line number in source file (if available) - - $function[pc]: string: function name containing given address - $module[pc]: string: kernel module name containing given address - $address[sym]: number: base address of given function symbol - - $pc: number: PC snapshot at invocation - $stack[depth]: number: PC of caller at given depth, if available - - $pid, $tgid, $uid, $comm : number/string : current-> fields - -* Dependencies: - - Debug-info files - -* Restrictions: - - The $pp series of variables are computed at translation time, and thus - are only applicable to those probes that have related debug-info points. - - $pc should be directly available. - - The $function series of read-only pseudo-arrays are calculated at - run time, from symbol table information passed in some way. - $stack[0] might take some probing in the registers, or (eek!) on the - target stack frame. Conservatively returning 0 instead may be okay. - - The current-based series of values ($pid etc.), for kernel-targeted - probes, need to check for !in_interrupt() before dereferencing current->. - -* Data collection: - - Several of the variables are translation-time constants, so these don't - have run-time collection needs. - - For a kernel/module probe, $function[] could be computed from the kallsyms - lookup functions. Alternately, the translator could emit a copy of the - target symbol table into the probe C code, which $function[] could - search. The $stack[] elements would be served by the runtime on a - best-effort basis. - -* Data presentation: - - n/a: variables are simple - -* Competition: - - unknown - -* Cross-references: - - http://sources.redhat.com/ml/systemtap/2005-q2/msg00395.html - http://sources.redhat.com/ml/systemtap/2005-q2/msg00281.html - -* Associated files: diff --git a/tapsets/dynamic_cg/dynamic_cg.txt b/tapsets/dynamic_cg/dynamic_cg.txt deleted file mode 100644 index 35de29f10..000000000 --- a/tapsets/dynamic_cg/dynamic_cg.txt +++ /dev/null @@ -1,64 +0,0 @@ -* Application name: Dynamic Callgraph -* Contact: William Cohen, wochen@redhat.com - -* Motivation: - -Dynamic Callgraph would provide information to allow developers to see -what other functions a function is calling. This could show that some -unexpected functions are getting called. DTrace has a instrumentation -provider that generates a trace of the functions called and returned. - -* Background: - -There have been times that people in Red Hat support have narrowed a -problem to a specific function and the functions it calls. Rather -than instrumenting the function's children by hand, a tapset that -provides a dynamic callgraph would allow quicker determination of the -things called. There are cases in the kernel code where determining -the function being called cannot be determined statically, -e.g. function to call is stored in a data structure. - -* Target software: - -Ideally both kernel and user space, but kernel space only would -be sufficient for many cases. - -* Type of description: tapset and scripting command - tapset to provide to support to capture call return information - scripting commands to turn on and off the capture - -* Interesting probe points: - -* Interesting values: - -* Dependencies: -- P6/x86-64 processors have the debug hardware to trap control flow chgs. -- Need to have the kernel maintain the debug hardware on a per process basis. - The DebugCtlMSR is not currently stored in the context - (only debug registers 0, 1, 2, 3, 6, and 7 are virtualized) - -* Restrictions: - May be difficult to implement on ppc: returns may look like regular jumps - and trapping on all branches could cause problems with atomic operations - Won't work on pre p6 x86 processors - Won't provide data for inlined funcions - -* Data collection: - Track whether the instruction was a call or a return and the target address. - -* Data presentation: - -processing address in userspace to convert addresses into function names - -trace showing calls and returns - -maybe further post process to build dynamic callgraph - determine that a function is being called way too often - -* Competition: - DTrace already implements tracing of function calls and returns. - -* Cross-references: - -* Associated files: - - $dynamic_call_graph = 1; // turn on tracing of calls for thread - $dynamic_call_graph = 0; // turn off tracing of calls for thread - diff --git a/tapsets/dynamic_cg/tapset.stp b/tapsets/dynamic_cg/tapset.stp deleted file mode 100644 index c731fac90..000000000 --- a/tapsets/dynamic_cg/tapset.stp +++ /dev/null @@ -1,7 +0,0 @@ -global $dynamic_call_graph -probe kernel.perfctr.call(1) { - if ($dynamic_call_graph) trace_sym ($pc); -} -probe kernel.perfctr.return(1) { - if ($dynamic_call_graph) trace_sym ($pc); -} diff --git a/tapsets/dynamic_cg/usage.stp b/tapsets/dynamic_cg/usage.stp deleted file mode 100644 index 1625768b5..000000000 --- a/tapsets/dynamic_cg/usage.stp +++ /dev/null @@ -1,30 +0,0 @@ - -probe.kernel.sys_open.entry() -{ - $dynamic_call_graph =1; -} - -# What you would see in the output would be something of this kind -# call sys_open -# call getname -# call do_getname -# return do_getname -# return getname -# call get_unused_fd -# call find_next_zero_bit -# return find_next_zero_bit -# return get_unused_fd -# call filep_open - ..... -return sys_open - -# The above probe could be customized to a particular process as well, -# like in the following - -probe.kernel.sys_open.entry() -{ -if ($PID == 1234) - $dynamic_call_graph =1; -} - - diff --git a/tapsets/profile/profile_tapset.txt b/tapsets/profile/profile_tapset.txt deleted file mode 100644 index e8899dc73..000000000 --- a/tapsets/profile/profile_tapset.txt +++ /dev/null @@ -1,503 +0,0 @@ -* Application name: Stopwatch and Profiling for systemtap - -* Contact: - Will Cohen wcohen@redhat.com - Charles Spirakis charles.spirakis@intel.com - -* Motivation: - Allow SW developers to improve the performance of their - code. The metholodies used are stopwatch (sometimes known - as event counting) and profiling. - -* Background: - Will has experience with oprofile - Charles has experience with vtune - -* Target software: - Initially the kernel, but longer term, both kernel and user. - -* Type of description: - General information regarding requirements and usage models. - -* Interesting probe points: - When doing profiling you have "asynchronous-event" probe points - (aka you get an interrupt and you'll want to capture information - about where that interrupt happened). - - When doing stopwatch, interesting probe points will be - function entry/exits, queue add/remove, queue entity lifecycle, - and any other code where you want to measure time - or events (cpu resource utilization) associated with a path of code - (frame buffer drawing measurements, graphic T&L pipeline - measurements, etc). - -* Interesting values: - For profiling, the pt_regs structure from the interrupt handler. The - most commonly used items would be the instruction pointer and the - call stack pointer. - - For stopwatch, most of the accesses are likely to be pmu read - operations. - - In addition, given the large variety of pmu capabilities, access - to the pmu registers themselves (read and write) would be very - important. - - Different pmu's have different events, but for script portability, - we may want to have a subset of predefined events and have something - map that into a pmu's particular event (similar to what papi does). - - Given the variety of performance events and pmu architectures, we - may want to try and have a standardized library/api as part of the - translator to map events (or specialzed event information) into - register/value pairs used during the actual systemtap run. - - ??? Classify values as consumed from lower level vs. provided to higher - level ??? - -* Dependencies: - Need some form of arbitration of the pmu to make sure the data provided - is valid (see perfmon below). - - Two common usage models are aggregated data (oprofile) and - trace history (papi/vtune). Currently these tools all do the - aggregation in user-mode and we may want to look at what - they do and why. - - The unofficial rule of thumb is that profiling should be - as unobtrusive as possible and definitely < 1% overhead. - - When doing stopwatch or profiling, there is a need to be able to - sequence the data. For timing this is important to be able to - accurately compute start/stop deltas and watch control/data flow. - For profiling, it is needed to support trace history. - - There needs to be a timesource that has reasonable granularity - and is reasonably precise. - - Per-thread virtualization (of time and events) - - System wide mode for pmu events - -* Restrictions: - Currently access to the pmu is a bit of a free for all with no - single entity providing arbitration. The perfmon2 patch for 2.6 - (see the cross reference section below) is attempting to - provide much of the infrastructure needed by profiling tools - (like oprofile and papi) across architectures (pentium M, ia64 - and x86_64 initially, though I understand Stephane has contacted - someone at IBM for a powerpc version as well). - - Andrew Morton wants a perfmon and perfcntr to be merged. Regardless - of what happens, both pmu libraries are geared more for - user->kernel access rather than kernel->kernel access and we - will need to see what can be EXPORT()'ed to make it more - kernel module friendly. - -* Data collection: - Pmu counters tend to be different widths on different - architectures. It would be useful to standardize the - width (in software) to 64-bits to make math operations - (such as comparisons, delta's, etc) easier. - - The goal of profiling is to go from: - pid/ip -> path/image -> source file/line number - - This implies the need to have a (reasonably quick) mechanism to - translate pid/ip to path/image. Potentially reuse the dcookie - methodology from oprofile but may need to extend that model if there - is a goal to support anonymous maps (dynamically generated code). - - Need the ability to map the current pid to a process name. - - Need to make a decision on how much will be handled via associative - arrays in the kernel and how much will be handled in user space - (potentially part of post processing). Given the volume of data that - can be generated during profiling, it may make sense to follow the - trend of current perfomrance tools and attempt to put merging and - aggregation in the user space instead of kernel space. - - To keep the overhead of collection low, it may be useful to look - into having some of the information needed be collected at interrupt - time and other pieces of information be collected after the - interrupt (top/bottom style). For example, although it may be - convienent to have a syntax like: - - val = associate_image($pt_regs->eip) - - it may be preferable to use a marker in the output stream instead - (oprofile used a dcookie) and then do a lookup later (either in the - kernel and add a marker->name entry to the output stream or in user - space similar to what oprofile did). This concept could be extended - to cover the lookup of the pid name as well. - - Stack information will need to be collected at interrupt time - (based on the interrupted pt_regs->esp) so the routine to obtain - the stack trace should be reasonably fast. Due to asynchronous probes, - the stack may be in user space. - - Depending on whether support of anonymous maps is important, it may - be useful to have a more generic method of mapping ip->path/module - which would allow dynamic code generates (usually found in user - space) to be able to provide ip->image map information as part of - the regular systemtap data stream. If we allow for a user-mode api - to add data to a systemtap stream, we could have a very general - purpose merge/aggregation tool for profiling from a variety of - sources. - -* Data presentation: - Generally data will be presented to the user as either an inorder - stream (trace history) or aggregated in some form to produce a - histogram or min/max/average/std. - - When aggregated, the data may be clumped by pid (each running of - the app provides unique data), process name (the data for an app - is merged for all runs), or it may be clumped by the loaded image - (to get information about shared libraries regardless of the app - that loaded it). Assuming an increase in multi processor and - multi threaded applications, grouping the data by thread group - id is likely to be useful as well. Ideally, if symbols/debug - information is available, additional aggregation could be done - at the function, basic block or source line. - -* Competition: - See the cross-reference list below - -* Cross-references: - Oprofile - - Oprofile is a profiling tool that provides time and event based - sampling. Its collection methodology has a "file view" of the - world and only captures the minimum information needed to get - the image that corresponds to the interrupted instruction - address. It aggregates the data (no time information) to keep - the total data size to a minimum even on long runs. Oprofile - allows for optional "escape" sequences in a data stream to add - information. It can handle non-maskable interrupts (NMI) as well - as maskable interrupts to obtain samples in areas where - maskable interrupts are normally disabled. Work is being done - to allow oprofile to handle anonymous maps (ie. dynamically - generated code from jvm's). - - http://oprofile.sourceforge.net/news/ - - Papi - - Papi is a profiling tool that can aggregate data or keep a trace - history. It uses tables to map generic event concepts (for example, - PAPI_TOT_CYC) into architecture specific events (for example, - CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be - time based and it can capture event counts (i.e. every 5ms, - capture cpu cycles and instructions retired) in addition to - the instruction pointer. Papi is built on top of other performance - monitoring support such as ia64 perfmon and i386 perfctr in the Linux - kernel. - - http://icl.cs.utk.edu/papi/ - - Perfmon2 infrastructure - - Perfmon2 is a profiling infrastructure currently in the linux 2.6 - kernel for ia64. It handles arbitration and virtualization - of the pmu resources, extends - the pmu's to a logical 64-bits regardless of the underlying hardware - size, context switches of the counters when needed to allow for - per-process or system-wide use, and has the ability to choose a subset - of the cpu's on a system when doing system-wide profiling. Oprofile on - Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently, - there are patches submitted for the Linux Kernel Mailing List for - the 2.6 kernel to port the perfmon2 - infrastructure to the Pentium M and x86_64. - - http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html - - Shark - - Shark is a profiling tool from Apple that focuses on time and event - based statistical stack sampling. On each profile interrupt, in - addition to capturing the instruction pointer, it also captures - a stack trace so you know both where you were and how you got there. - - http://developer.apple.com/tools/sharkoptimize.html - - Vtune - - Vtune is a profiling tool that provides time and event based - sampling. It does collection based on a "process view" of the - world. It keeps a trace history so that you can aggregate the - data during post processing in various ways, it can capture - architectural specific data in addition to ip (such as branch - history buffers), and it can use architectural specific abilities - to get exact ip addresses for certain events. Currently handles - anonymous mappings (dynamically generated code from jvm's). - - http://www.intel.com/software/products/vtune/vlin/index.htm - - -* Associated files: - Should the usage models be split into a separate file? - -Usage Models: - Below are some typical usage models. This isn't an attempt - to propose syntax, it's an attempt to create something - concrete enough to help people understand the goals: - (description, psuedo code, desired output). - -Description: Statistical stack sampling (ala shark) - - probe kernel.time_ms(10) - { - i = associate_image($pt_regs->eip); - s = stack($pt_regs->esp); - stp($current->pid, $pid_name, $pt_regs->eip, i, s) - } - - Output desired: - For each process/prcess name, aggregate (histogram) based - on eip (regardless how I got there), stack (what was the - most common calling path), or both (what was the most common - path to the most common eip). - Could be implemented by generating a trace history and let the - user post process (eats disk space, but one run can be viewed - multiple ways) or could have the user define what was wanted - in the script and do the post processing ourselves (saves disk space, - but more work for us). - -Description: Time based aggregation (ala oprofile) - - probe kernel.time_ms(10) - { - i = associate_image($pt_regs->eip); - stp($current->pid, $pid_name, $ptregs->eip, i); - } - - Output desired: - Histogram separated by process name, pid/eip, pid/image - -Description: Time a routine part 1 - time between the function call and return: - - probe kernel.function("sys_exec") - { - $thread->mystart = $timestamp - } - probe kernel.function("sys_exec").return - { - delta = $timestamp - $thread->mystart - - // do statistical operations... - } - - Output desired: - Be able to do statistics for the time it takes for an exec - function to execute. The time needs to have a fine enough - granularity to have meaning (i.e. using jiffies probably wouldn't work) - and the time needs to be smp correct even if the probe entry - and the return execute on different processors. - -Description: Time a routine part 2 - count the number of events between the - function call and return: - - probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve") - { - $thread->myclocks = $pmu[0]; - $thread->myinstr_ret = $pmu[1]; - } - probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return - { - $thread->myclocks = $pmu[0] - $thread->myclocks; - $thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret; - - cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret - - // Do statistical operations... - } - - Desired Output: - Produce min/max/average for cycles, instructions retired, - and cycles_per_instruction. The pmu must be virtualized if the - probe entry and probe exit can happen on different processors. The - pmu should be virtualized if there can be pre-emtption (or waits) in - the function itself to get more useful information (the actual count - of events in the function vs. a count of events in the whole system - between when the function starts and when it ended) - -Description: Time a routine part 3 - reminder of threading issues - - probe kernel.function("sys_fork") - { - $thread->mystart = $timestamp - } - probe kernel.function("sys_fork").return - { - delta = $timestamp - $thread->mystart - - If (parent) { - // do statistical operations for time it takes parent - } else { - // do statistical operations for time it takes child - } - } - - Desired Output: - Produce min/max/average for the parent and the child. The - time needs to have a fine enough granularity to have - meaning (i.e. using jiffies probably wouldn't work) - and the time needs to be smp correct even if the probe entry - and the probe return execute on different processors. - -Description: Time a routine part 4 - reminder of threading issues - - probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork") - { - $thread->myclocks = $pmu[0]; - $thread->myinstr = $pmu[1]; - } - probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return - { - $thread->myclocks = $pmu[0] - $thread->myclocks; - $thread->myinstr = $pmu[1] - $thread->myinstr; - - cycles_per_instruction = $thread->myclocks / $thread->myinstr - - If (parent) { - // Do statistical operations... - } else { - // Do statistical operations... - } - } - - Desired Output: - Produce min/max/average for cycles, instructions retired, - and cycles_per_instruction. The pmu must be virtualized if the - probe entry and probe exit can happen on different processors. The - pmu should be virtualized if there can be pre-emtption (or waits) in - the function itself to get more useful information (the actual count - of events in the function vs. a count of events in the whole system - between when the function starts and when it ended) - -Description: Beginnings of "papi" style collection - - probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10) - { - i = associate_image($pt_regs->eip); - stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]); - } - - Desired output: - Trace history or aggregation based on process name, image - -Description: Find the path leading to high latency cache miss - that stalled for more than 128 cycles (ia64 only) - - probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f") - { - // - // on ia64, when using the data ear event, the precise eip is - // saved in pmd[17], so no need for pt_regs->eip (and the - // associated skid)... - // - i = associate_image($pmu->pmd[17]); - stp($current->pid, $pid_name, $pmu->pmd[17], i, // the basics - $pmu->pmd[2], // precise data address - $pmu->pmd[3], // latency information - $pmu->pmd[8], // branch history buffer - $pmu->pmd[9], // " - $pmu->pmd[10], // " - $pmu->pmd[11], // " - $pmu->pmd[12], // " - $pmu->pmd[13], // " - $pmu->pmd[14], // " - $pmu->pmd[15], // " - $pmu->pmd[16]); // indication of which was most recent branch - } - - Desired output: - Aggregate data based on pid, process name, eip, latency, and - data address. Each pmd on ia64 is 64 bits long, thus the capturing - of just the 12 pmd's listed hear is 96 bytes of information every - interrupt for each cpu. Profiling can have a very high amount of - data collected... - -Description: Pmu event collection of data but use NMI - instead of the regular interrupt. - -NMI is useful for getting visibily on locks and other code which is -normally hidden behind interrupt disable code. However, handling an -NMI is more difficult to do properly. Potentially the compiler can be -more restrictive on what's allowed in the handler when NMI's are -selected as the interrupt method. - - - probe kernel.nmi.pmu_profile("instructions_retired:1000000") - { - i = associate_image($pt_regs->eip); - stp($pid_name, $ptregs->eip, i); - } - - Desired Output: - Same as the earlier oprofile style example - - -Description: Timing items in a queue - - Two possibilities - use associative arrays or post process - -Associative arrays: - - probe kernel.function("add queue function") - { - start[$arg->queue_entry] = $timestamp; - } - probe kernel.function("remove queue function") - { - delta = $timestamp - start[$arg->queue_entry]; - - // Do statistics on the delta value and the queue entry - } - -Post process: - - probe kernel.function("add queue function") - { - stp("add", $timestamp, $arg->queue_entry) - } - probe kernel.function("remove queue function") - { - stp("remove", $timestamp, $arg->queue_entry) - } - -Desired Output: - For each queue_entry, calculate the delta and do appropriate - statistics. - - -Description: Following an item as it moves to different queues/lists - - Two possibilities - use associative arrays or post process - This exam - -Associative arrays: - - probe kernel.function("list_add") - { - delta = $timestamp - start[$arg->head, $arg->new]; - start[$arg->head, $arg->new] = $timestamp; - // Do statistics on the delta value and queue - } - - -Post process: - - probe kernel.function("list_add") - { - stp("add", $timestamp, $arg->head, $arg->new) - } - -Desired Output: - For each (queue, queue_entry) pair, calculate the delta and do appropriate - statistics. - diff --git a/tapsets/timestamp/timestamp_tapset.txt b/tapsets/timestamp/timestamp_tapset.txt deleted file mode 100644 index dcbd58133..000000000 --- a/tapsets/timestamp/timestamp_tapset.txt +++ /dev/null @@ -1,327 +0,0 @@ -* Application name: sequence numbers and timestamps - -* Contact: - Martin Hunt hunt@redhat.com - Will Cohen wcohen@redhat.com - Charles Spirakis charles.spirakis@intel.com - -* Motivation: - On multi-processor systems, it is important to have a way - to correlate information gathered between cpu's. There are two - forms of correlation: - - a) putting information into the correct sequence order - b) providing accurate time deltas between information - - If the resolution of the time deltas is high enough, it can - also be used to order information. - -* Background: - Discussion started due to relayfs and per-cpu buffers, but this - is neede by many people. - -* Target software: - Any software which wants to correlate data that was gathered - on a multi-processor system, but the scope will be defined - specifically for systemtap's needs. - -* Type of description: - General information and discussion regarding sequencing and timing. - -* Interesting probe points: - Any probe points where you are trying to get the time between two - probe points. For example, timing how long a function takes and - putting probe points at the function entry and function exit. - -* Interesting values: - Possible ways to order data from multiple sources include: - -Retrieve the sequence/time from a global area - - High Precision Event Timer (HPET) - Possible implementation: - multimedia/HPET timer - arch/i386/kernel/timer_hpet.c - Advantages: - granularity can vary (HPET spec says minimum freq of HPET timer - is 10Mhz =~100ns resolution), can be treated as read-only, - can bypass cache update and avoid being caced at all if desired, - desigend to be used as an smp timestamp (see specification) - Disadvantages: - may not be available on all platforms, may not be synchronized on - NUMA systems (ie counts for all processors within a numa node is - comparable, but counts for processors between nodes may not be - comparable), potential resource conflict if timers used by - other software - - Real Time Clock (RTC) - Possible implementation: - "external" chip (clock chip) which has time information, accessed via - ioport or memory-mapped io - Advantages: - can be treated as read-only, can bypass cache update and avoid being - cached at all if desired - Disadvantages: - may not be available on all platforms, low granularity (for rtc, - ~1ms), usually slow access - - ACPI Power Management Timer (pm timer) - Possible implementation: - implemented as part of the ACPI specification at 3.579545Mhz - arch/i386/kernel/timers/timer_pm.c - Advantages: - not affected by throttling, halting or power saving states, moderate - granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS - to keep track of time during sleep/power states - Disadvantages: - may not be available on all platforms, slower access than hpet timer - (but still much faster than RTC) - - Chipset counter - Possible implementation: - timer on a processor chipset, ??SGI implementation??, do we know of any - other implementations ? - Advantages: - likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus - clock (200Mhz = ~5ns) - Disadvantages: - may not be available on all platforms - - Sequence Number - Possible implementation: - atomic_t global variable, cache aligned, placed in struct to keep - variable on a cache line by itself - Advantages: - guaranteed correct ordering (even on NUMA systems), architecure - independent, platform independent - Disadvantages: - potential for cache-line ping-pong, doesn't scale, no time - information (ordering data only), access can be slower on NUMA systems - - Jiffies - Possible implementation: - OS counts the number of "clock interrupts" since power on. - Advantages: - platform independent, architecture independent, one writer, many - readers (less cache ping-pong) - cached at all if desired - Disadvantages: - low resolution (usually 10ms, sometimes 1ms). - - Do_gettimeofday - Possible implementation: - arch/i386/kernel/time.c - Advantages: - platform independent, architecture independent, 1 writer, many - readers (less cache ping-pong), accuracy of micro-seconds - Disadvantages: - the time unit increment value used by this routine changes - based on information from ntp (i.e if ntp needs to speed up / slow - down the clock, then callers to this routine will be affected). This - is a disadvantage for timing short intervals, but an advantage - for timing long intervals. - - -Retrieve the sequence/time from a cpu-unique area - - Timestamp counter (TSC) - Possible implementation: - count of the number of core cycles the processor has executed since - power on, due to lack of synchronization between cpus, would also - need to keep track of which cpu the tsc came from - Advantages: - no external bus access, high granularity (cpu core cycles), - available on most (not all) architectures, platform independent - Disadvantages: - not synchronized between cpus, since it is a count of cpu cycles - count can be affected by throttling, halting and power saving states, - may not correlate to "actual" time (ie, just because a 1G processor - showed a delta of 1G cycles, doesn't mean 1 second has passed) - - APIC timer - Possible implementation: - timer implemented within the processor - Advantages: - no external bus access, moderate to high granularity (usually - counting based front-side bus clock or core clock) - Disadvantages: - not synchronized between cpus, may be affected by throttling, - halting/power saving states, may not correlate to "actual" time. - - PMU event - Possible implementation: - program a perfmonance counter with a specific event related to time - Advantages: - no external bus access, moderate to high granularity (usually - counting based front-side bus clock or core clock), can be - virtualized to give moderate to high granularity for individual - thread paths - Disadvantages: - not synchronized between cpus, may be affected by throttling, - halting/power saving states, may not correlate to "actual" time, - processor dependent - - - For reference, as a quick baseline, on Martin's dual-processor system, - he gets the following performance measurements: - - kprobe overhead: 1200-1500ns (depending on OS and processor) - atomic read plus increment: 40ns (single processor access, no conflict) - monotonic_clock() 550ns - do_gettimeofday() 590ns - -* Dependencies: - Not Applicable - -* Restrictions: - Certain timers may already be in use by other parts of the kernel - depending on how it is configured (for example, RTC is used by the - watchdog code). Some kernels may not compile in the necessary code - (for example, if using the pm timer, need ACPI). Some platforms - or architectures may not have the timer requested (for example, - there is no HPET timer on older systesm). - -* Data collection: - For data collection, it is probably best to keep the concept - between sequence ordering and timestamp separate within - systemtap (for both the user as well as the implementation). - - For sequence ording, the initial implementation should use ?? the - atomic_t form for the sequence ordering (since it is guaranteed - to be platform and architecture neutral)?? and modify/change the - implementation later if there is a problem. - - For timestamp, the initial implementation should use - ?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ?? - some combination (do_gettimeofday + cpu # & low bits of tsc)? - - We could do something like what LTT does (see below) to - generate 64-bit timestamps containing the nanoseconds - since Jan 1, 1970. - - Assuming the implementation keeps these concepts separate now - (ordering data vs. timing deltas), it is always possible to - merge them in the future if a high granularity, numa/smp - synchronized timesource becomes available for a large number - of platforms and/or processors. - -* Data presentation: - In general, users prefer output which is based on "actual" time (ie, - they prefer an output that says the delta is XXX nanoseconds instead - of YYY cpu cycles). Most of the time users want delta's (how long did - this take), but occasionally they want absolute times (when / what time - was this information collected) - -* Competition: - DTRACE has output in nanoseconds (and it is comparable between - processors on an mp system), but it is unclear what the actual - resolution is. Even if the sparc machine does have hardware - that provides nanosecond resolution, on x86-64 they are likely - to have the same problems as discussed here since the solaris - opteron box tends to be a pretty vanilla box. - - From Joshua Stone (joshua.i.stone at intel.com): - - == BEGIN === - DTrace gives you three built-in variables: - - uint64_t timestamp: The current value of a nanosecond timestamp - counter. This counter increments from an arbitrary point in the - past and should only be used for relative computations. - - uint64_t vtimestamp: The current value of a nanosecond timestamp - counter that is virtualized to the amount of time that the current - thread has been running on a CPU, minus the time spent in DTrace - predicates and actions. This counter increments from an arbitrary - point in the past and should only be used for relative time computations. - - uint64_t walltimestamp: The current number of nanoseconds since - 00:00 Universal Coordinated Time, January 1, 1970.  - - As for how they are implemented, the only detail I found is that - timestamp is "similar to the Solaris library routine gethrtime". - The manpage for gethrtime is here: - http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view - == END == - - What LTT does: - - "Cycle counters are fast to read but may reflect time - inaccurately. Indeed, the exact clock frequency varies - with time as the processor temperature changes, influenced - by the external temperature and its workload. Moreover, in - SMP systems, the clock of individual processors may vary - independently. - - LTT corrects the clock inaccuracy by reading the real time - clock value and the 64 bits cycle counter periodically, at - the beginning of each block, and at each 10ms. This way, it - is sufficient to read only the lower 32 bits of the cycle - counter at each event. The associated real time value may - then be obtained by linear interpolation between the nearest - full cycle counter and real time values. Therefore, for the - average cost of reading and storing the lower 32 bits of the - cycle counter at each event, the real time with full resolution - is obtained at analysis time." - - -* Cross-references: - The profile_tapset is very dependent on sequencing and time when - ordering of data (i.e. taking a trace history) as well as high - granularity (when calculating time deltas). - -* Associated files: - - Profile tapset requirements - .../src/tapsets/profile/profile_tapset.txt - - Intel high precision event timers specification: - http://www.intel.com/hardwaredesign/hpetspec.htm - - ACPI specification: - http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf - - From an internal email sent by Tony Luck (tony.luck at intel.com) - regarding a clustered environment. For the summary below, hpet and - pm timer were not an option. For systemtap, they should be considered, - especially since pm timer and hpet were designed to be a timestamp. - - == BEGIN === - For extremely short intervals (<100ns) get some h/w help (oscilloscope - or logic analyser). Delays reading TSC and pipeline effects could skew - your results horribly. Having a 2GHz clock doesn't mean that you can - measure 500ps intervals. - - For short intervals (100ns to 10 ms) TSC is your best choice ... but you - need to sample it on the same cpu, and converting the difference between - two TSC values to real time will require some system dependent math to find - the nominal frequency of the system (you may be able to ignore temperature - effects, unless your system is in an extremely hostile environment). But - beware of systems that change the TSC rate when making frequency - adjustments for power saving. It shouldn't be hard to measure the - system clock frequency to about five significant digits of accuracy, - /proc/cpuinfo is probably good enough. - - For medium intervals (10 ms to a minute) then "gettimeofday()" or - "clock_gettime()" on a system *NOT* running NTP may be best, but you will - need to adjust for systematic error to account for the system clock running - fast/slow. Many Linux systems ship with a utility named "clockdiff" that - you can use to measure the system drift against a reference system - (a system that is nearby on the network, running NTP, prefereably a - low "stratum" one). - - Just run clockdiff every five minutes for an hour or two, and plot the - results to see what systematic drift your system has without NTP. N.B. if - you find the drift is > 10 seconds per day, then NTP may have - trouble keeping this system synced using only drift corrections, - you might see "steps" when running NTP. Check /var/log/messages for - complaints from NTP. - - For long intervals (above a minute). Then you need "gettimeofday()" on a - system that uses NTP to keep it in touch with reality. Assuming reasonable - network connectivity, NTP will maintain the time within a small number of - milliseconds of reality ... so your results should be good for - 4-5 significant figures for 1 minute intervals, and better for longer - intervals. - == END == - -- 2.43.5