As work already in progress, this bug is simply to track the completion of a crash-avoiding alternative to array/buffer allocations.
*** Bug 3592 has been marked as a duplicate of this bug. ***
*** Bug 3593 has been marked as a duplicate of this bug. ***
If the intent is that the 2006-11-15 STP_ALLOC_FLAGS-related changes is sufficient, it would be good to prove this with some tests in the suite.
Changes to the translator and runtime over the past two weeks have, AFAICT, fixed the crashing issue. There is no test case because we have not yet addressed the more general problem of poor interaction with linux's overcommitting allocator and oom-killer. One one of my test systems I sometimes see oom-killer being invoked and killing staprun. This leaves an orphaned systemtap module still in memory. This is unacceptable. Do you want to leave this BZ open for the test case, or just change the summary?
(In reply to comment #4) > Changes to the translator and runtime over the past two weeks have, AFAICT, > fixed the crashing issue. OK, a test case for even mild scenarios would be good. > There is no test case because we have not yet addressed the more general problem > of poor interaction with linux's overcommitting allocator and oom-killer. One > one of my test systems I sometimes see oom-killer being invoked and killing > staprun. When? During probe startup? After? Well after? > This leaves an orphaned systemtap module still in memory. This is > unacceptable. Really? Nothing much must break if staprun happens to be killed by an erroneous kill -9. The module should be removable cleanly with rmmod at any time. The module could self-terminate if it detects staprun going away suddenly (though I thought it already did that at one point). > Do you want to leave this BZ open for the test case, or just change the summary? Both. :-)
( > > There is no test case because we have not yet addressed the more general problem > > of poor interaction with linux's overcommitting allocator and oom-killer. One > > one of my test systems I sometimes see oom-killer being invoked and killing > > staprun. > > When? During probe startup? After? Well after? If you set MAXMAPENTRIES too large, it will happen before probe startup. But only rarely and only on vmware. However you can imagine that if MAXMAPENTRIES is set just right, all of systemtap's memory could be allocated successfully, and then some other app decides it wants some memory it thinks it allocated and that memory isn't really available and oom-killer gets invoked. > > This leaves an orphaned systemtap module still in memory. This is > > unacceptable. > > Really? Nothing much must break if staprun happens to be killed by an > erroneous kill -9. Nothing breaks, except with the caching, we cannot rerun the script. > The module should be removable cleanly with rmmod > at any time. It can be. > The module could self-terminate if it detects staprun > going away suddenly (though I thought it already did that at one point). Yeah, that's the problem. If there is a way for a module to unload itself, I don't know about it. That's why my preferred approach is to force oom-killer to kill stap and not staprun. Killing stap would be detected by staprun which would unload the module and then itself.
> > > This leaves an orphaned systemtap module still in memory. This is > > > unacceptable. > > > > Really? Nothing much must break if staprun happens to be killed by an > > erroneous kill -9. > > Nothing breaks, except with the caching, we cannot rerun the script. OK, that's not that bad. staprun could print a better error message, and suggest rmmod'ing the duplicate. > > The module could self-terminate if it detects staprun > > going away suddenly (though I thought it already did that at one point). > > Yeah, that's the problem. If there is a way for a module to unload itself, I > don't know about it. Right. At least, we could run the shutdown code to release memory and unregister the probes, and spit out a printk as an explanation. > That's why my preferred approach is to force oom-killer to > kill stap and not staprun. Thing is, both stap and staprun will ideally use up rather little core during actual execution. If the OOM guy is hungry, both may well get the knife.
This is fixed except for the discussed oom-killer interaction. I opened 4815 as a new PR for just that issue.