Created attachment 11688 [details] Sample stap script to reproduce the p2 slowness Even though fche introduced a tagged_p() and find_string_memoized map to optimize this last month, it is still slow when a lot of embeddedcode nodes are present in the parse tree (or AST). A lot of time is spent on the rbtree as well as the string comparisons. My first attempt of addressing this with boolean flags gave very good performance result but fche still has his concerns. This patch was submitted to the mailing list: https://sourceware.org/ml/systemtap/2019-q1/msg00071.html Using fche's suggested embedded* object level memo cache based on std::unordered_map can yield 55% p2 time reduction, but still far from the improvement of my original patch. The patch is here: https://pastebin.com/vCbxcwdf There's still quite some overhead involved with the hash table itself and std::string objects' construction and destruction, according to the corresponding CPU flame graphs.. Not even on the same order of magnitude. And a global std::unordered_map would make the performance 2x *worse*. The patch is here: https://pastebin.com/HLPcmyLW According to the flame graph, the majority of the CPU time now spends on the hash function on string pairs. I've prepared an artificial .stp script which resembles my real .stp script. I've attached it to this PR. Running it using the unpatched stap gives Pass 2: analyzed script: 1 probe, 1015 functions, 1 embed, 999 globals using 373460virt/88080res/7724shr/80368data kb, in 7740usr/0sys/7742real ms. while using my original patch doing boolean flags, it only takes a tiny fraction of time on P2: Pass 2: analyzed script: 1 probe, 1015 functions, 1 embed, 999 globals using 373476virt/88248res/7888shr/80392data kb, in 860usr/0sys/858real ms. Not on the same order of magnitude at all.
Got some flame graphs for the generated .stp sample script attached to my previous comment: Classic CPU flame graph: http://openresty.org/misc/flamegraph/stap-unordered-map-generated-script-2019-03-19.svg Reversed CPU flame graph: http://openresty.org/misc/flamegraph/inv-stap-unordered-map-generated-script-2019-03-19.svg The stap has got the patch here: https://pastebin.com/vCbxcwdf And you can see all those hash table overhead introduced by `std::unordered_map::find()`.
commit 7ab4d70f5 commit b9392d1e3 constitute an algorithmic fix that takes pass-2 runtime from 22s (stap 4.0 release) through 19s (git HEAD^^) down to 0.3s (git HEAD)