24363 – Slow pass 2 due to varuse_collecting_visitor::visit_embeddedcode()

Bug 24363 - Slow pass 2 due to varuse_collecting_visitor::visit_embeddedcode()

Summary: Slow pass 2 due to varuse_collecting_visitor::visit_embeddedcode()

Status:	RESOLVED FIXED

Alias:	None

Product:	systemtap
Classification:	Unclassified
Component:	translator (show other bugs)
Version:	unspecified

Importance:	P2 normal
Target Milestone:	---
Assignee:	Unassigned

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-19 22:18 UTC by agentzh
Modified:	2019-03-22 15:23 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Last reconfirmed:

Attachments
Sample stap script to reproduce the p2 slowness (42.96 KB, text/plain) 2019-03-19 22:18 UTC, agentzh	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description agentzh 2019-03-19 22:18:08 UTC

Created attachment 11688 [details]
Sample stap script to reproduce the p2 slowness

Even though fche introduced a tagged_p() and find_string_memoized map to optimize this last month, it is still slow when a lot of embeddedcode nodes are present in the parse tree (or AST). A lot of time is spent on the rbtree as well as the string comparisons.

My first attempt of addressing this with boolean flags gave very good performance result but fche still has his concerns. This patch was submitted to the mailing list: https://sourceware.org/ml/systemtap/2019-q1/msg00071.html

Using fche's suggested embedded* object level memo cache based on std::unordered_map can yield 55% p2 time reduction, but still far from the improvement of my original patch. The patch is here: https://pastebin.com/vCbxcwdf There's still quite some overhead involved with the hash table itself and std::string objects' construction and destruction, according to the corresponding CPU flame graphs..

Not even on the same order of magnitude. And a global std::unordered_map would make the performance 2x *worse*. The patch is here: https://pastebin.com/HLPcmyLW According to the flame graph, the majority of the CPU time now spends on the hash function on string pairs.

I've prepared an artificial .stp script which resembles my real .stp script. I've attached it to this PR. Running it using the unpatched stap gives

Pass 2: analyzed script: 1 probe, 1015 functions, 1 embed, 999 globals using 373460virt/88080res/7724shr/80368data kb, in 7740usr/0sys/7742real ms.

while using my original patch doing boolean flags, it only takes a tiny fraction of time on P2:

Pass 2: analyzed script: 1 probe, 1015 functions, 1 embed, 999 globals using 373476virt/88248res/7888shr/80392data kb, in 860usr/0sys/858real ms.

Not on the same order of magnitude at all.

Comment 1 agentzh 2019-03-19 22:27:19 UTC

Got some flame graphs for the generated .stp sample script attached to my previous comment:

Classic CPU flame graph:

http://openresty.org/misc/flamegraph/stap-unordered-map-generated-script-2019-03-19.svg

Reversed CPU flame graph:

http://openresty.org/misc/flamegraph/inv-stap-unordered-map-generated-script-2019-03-19.svg

The stap has got the patch here: https://pastebin.com/vCbxcwdf And you can see all those hash table overhead introduced by `std::unordered_map::find()`.

Comment 2 Frank Ch. Eigler 2019-03-22 15:23:50 UTC

commit 7ab4d70f5
commit b9392d1e3

constitute an algorithmic fix that takes pass-2 runtime
from 22s (stap 4.0 release)
through 19s (git HEAD^^)
down to 0.3s (git HEAD)