Sleeping Beauties
Problem
Some threads, including kernel threads, sometimes go to sleep (in "D" state) for too long. Wouldn't it be nice to find out what they are doing?
Scripts
This script intercepts the entry and exit to the wait_for_completion scheduler function, recording a timestamp. If the thread sleeps for "too long" (as defined by a configurable number of time units, like 10 milliseconds), then a separate timer-based probe will describe the thread as "impatient", and print a stack traceback at the start of wait. If this happens, the return probe will also print a line to report the complete elapsed "bored" (blocked) time.
function time () { return gettimeofday_ms() } global time_name = "ms" global boredom = 10 # in time units probe kernel.function("wait_for_completion").call { t=tid() back[t]=backtrace() name[t]=execname() backtime[t]=time() delete bored[t] } probe kernel.function("wait_for_completion").return { t=tid() if ([t] in bored) { patience = time() - backtime[t] printf ("thread %d (%s) bored for %d %s\n", t, name[t], patience, time_name) } delete bored[t] delete back[t] delete name[t] delete backtime[t] } probe timer.profile { foreach (tid+ in back) { if ([tid] in bored) continue patience = time() - backtime[tid] if (patience >= boredom) { printf ("thread %d (%s) impatient after %d %s\n", tid, name[tid], patience, time_name) print_stack (back[tid]) printf ("\n") bored[tid] = 1 # defer further reports to wakeup } } } global name, back, backtime, bored
Output
The 10 millisecond default boredom interval will trigger relatively frequently on a machine with ordinary disk drives. If this number is raised to hundreds of milliseconds, one will see reports of only the particularly long-duration blockages.
# stap sleepingBeauties.stp thread 2086 (make) impatient after 19 ms 0xc05edec6 : wait_for_completion+0x1/0x98 [] 0x00000000 0x00000bac 0xf0aace87 : probe_1524+0x1e8/0x56c [stap_e37f8d9fbd67c1df1215efde25cd0604_2362 ] 0xf0aac634 : enter_kprobe_probe+0xfc/0x217 [stap_e37f8d9fbd67c1df1215efde25cd06 04_2362] 0xc05f2963 : aggr_pre_handler+0x35/0x75 [] 0xc05f1193 : kprobe_exceptions_notify+0x187/0x3db [] 0xc05f1d70 : notifier_call_chain+0x2b/0x4a [] 0xc05f1dab : __atomic_notifier_call_chain+0x1c/0x22 [] 0xc05f1dcb : atomic_notifier_call_chain+0x1a/0x1c [] 0x0c04380f thread 2086 (make) bored for 19 ms thread 2106 (gcc) impatient after 20 ms 0xc05edec6 : wait_for_completion+0x1/0x98 [] 0x00000000 0x00000bac 0xf0aace87 : probe_1524+0x1e8/0x56c [stap_e37f8d9fbd67c1df1215efde25cd0604_2362 ] 0xf0aac634 : enter_kprobe_probe+0xfc/0x217 [stap_e37f8d9fbd67c1df1215efde25cd06 04_2362] 0xc05f2963 : aggr_pre_handler+0x35/0x75 [] 0xc05f1193 : kprobe_exceptions_notify+0x187/0x3db [] 0xc05f1d70 : notifier_call_chain+0x2b/0x4a [] 0xc05f1dab : __atomic_notifier_call_chain+0x1c/0x22 [] 0xc05f1dcb : atomic_notifier_call_chain+0x1a/0x1c [] 0x0c04380f thread 2106 (gcc) bored for 43 ms thread 2238 (collect2) impatient after 12 ms 0xc05edec6 : wait_for_completion+0x1/0x98 [] 0x00000000 0x00000bac 0xf0aace87 : probe_1524+0x1e8/0x56c [stap_e37f8d9fbd67c1df1215efde25cd0604_2362 ] 0xf0aac634 : enter_kprobe_probe+0xfc/0x217 [stap_e37f8d9fbd67c1df1215efde25cd06 04_2362] 0xc05f2963 : aggr_pre_handler+0x35/0x75 [] 0xc05f1193 : kprobe_exceptions_notify+0x187/0x3db [] 0xc05f1d70 : notifier_call_chain+0x2b/0x4a [] 0xc05f1dab : __atomic_notifier_call_chain+0x1c/0x22 [] 0xc05f1dcb : atomic_notifier_call_chain+0x1a/0x1c [] 0x0c04380f thread 2249 (collect2) impatient after 13 ms 0xc05edec6 : wait_for_completion+0x1/0x98 [] 0x00000000 0x00000bac 0xf0aace87 : probe_1524+0x1e8/0x56c [stap_e37f8d9fbd67c1df1215efde25cd0604_2362 ] 0xf0aac634 : enter_kprobe_probe+0xfc/0x217 [stap_e37f8d9fbd67c1df1215efde25cd06 04_2362] 0xc05f2963 : aggr_pre_handler+0x35/0x75 [] 0xc05f1193 : kprobe_exceptions_notify+0x187/0x3db [] 0xc05f1d70 : notifier_call_chain+0x2b/0x4a [] 0xc05f1dab : __atomic_notifier_call_chain+0x1c/0x22 [] 0xc05f1dcb : atomic_notifier_call_chain+0x1a/0x1c [] 0x0c04380f thread 2238 (collect2) bored for 121 ms thread 2249 (collect2) bored for 28 ms
Lessons
Systemtap's stack backtrace function is constantly improving ... and exproving too. So the sample output above should have had a far more informative message than here. http://sources.redhat.com/PR4633
The script could omit the timer.profile probe and just print overlong wait reports within the .return probe itself. A problem with this is that if an indefinitely long blockage occurs, one might never see the report.
The script could intercept thread block events in other places - or hook the other wait_for_completion_... variants. This could be useful for NFS-related problems.
It should be possible to have the system automatically identify excessive delays in terms of standard deviations or such other computed measures. We'll leave that to another "war story".
A begin probe really should report the boredom and timekeeping parameters.