Bug 17461 - probing process.end crashes on busy systems
Summary: probing process.end crashes on busy systems
Status: RESOLVED WORKSFORME
Alias: None
Product: systemtap
Classification: Unclassified
Component: runtime (show other bugs)
Version: unspecified
: P2 normal
Target Milestone: ---
Assignee: Unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-06 20:49 UTC by Jonathan Lebon
Modified: 2021-05-04 01:06 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:


Attachments
crash_testcase.exp (432 bytes, text/plain)
2014-10-06 20:49 UTC, Jonathan Lebon
Details
dmesg.log (2.41 KB, text/plain)
2014-10-06 20:50 UTC, Jonathan Lebon
Details
dmesg.log (666 bytes, text/plain)
2015-05-13 20:31 UTC, Jonathan Lebon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Lebon 2014-10-06 20:49:36 UTC
Created attachment 7816 [details]
crash_testcase.exp

Running the following simple script on a busy system (where many processes are created/destroyed quickly) eventually causes the system to lock up. It takes a while sometimes to occur (e.g. 1-2 hours), but it always does. I haven't been able so far to determine the cause of the issue, although the backtraces might implicate utrace.
Comment 1 Jonathan Lebon 2014-10-06 20:50:07 UTC
Created attachment 7817 [details]
dmesg.log
Comment 2 Jonathan Lebon 2014-10-06 20:52:47 UTC
Forgot to add, this happened on f20 3.16.2-200 on git stap at least as of commit 3525152, but also earlier (including prior to the rt patches). Will try to do a bisect.
Comment 3 David Smith 2014-10-06 21:11:58 UTC
I'd certainly suspect utrace, especially since I see utrace_free() in your dmesg output. However, I also see _raw_spin_lock, and that's got me confused. We added some patches recently to add support for realtime kernels, but we shouldn't be using raw spinlocks anywhere but realtime kernels.

The only real utrace change lately was the following:

====
commit d9d07e99777c6e7aaaa8db0049c5fd5e5a2f01b0                                 
Author: David Smith <dsmith@redhat.com>                                         
Date:   Fri Jul 18 15:49:39 2014 -0500                                          
                                                                                
    Fixed PR17181 by making utrace handle interrupting processes better.        
====
Comment 4 Jonathan Lebon 2015-05-13 20:31:53 UTC
Created attachment 8312 [details]
dmesg.log

This is still an issue on the latest f20 3.19.5 with the latest git stap. Interestingly, adding debug statements in utrace_free() confirms that the crash does not happen there, but the rest of the stack is still very similar (showing a backtrace coming from exit() related calls).
Comment 5 Frank Ch. Eigler 2021-05-04 01:06:24 UTC
running this test on a rawhide (5.13-rc0 kernel, 4.5-rc stap), it's solid.