This is the mail archive of the
mailing list for the elfutils project.
[PATCH 0/3] Performance tweaks for libdw
- From: Josh Stone <jistone at redhat dot com>
- To: elfutils-devel at lists dot fedorahosted dot org
- Date: Tue, 10 Dec 2013 17:35:39 -0800
- Subject: [PATCH 0/3] Performance tweaks for libdw
I did some investigation into libdw performance hotspots, and came up
with a few tweaks that in total trim nearly 1/3 the running time. I'm
running on Mark's recent empty-location patch, and I primarily used
"tests/varlocs -k >/dev/null" as a moderately long-running benchmark.
I'm using gcc-4.8.2-1.fc20.x86_64 and kernel-3.11.9-300.fc20.x86_64,
running on an i7-2600.
The perf profile initially looked like this:
Samples: 343K of event 'cycles', Event count (approx.): 318932712301
33.69% varlocs libdw.so [.] __libdw_find_attr
16.47% varlocs libdw.so [.] lookup.isra.0
15.63% varlocs libdw.so [.] __libdw_form_val_len
13.11% varlocs libdw.so [.] dwarf_siblingof
4.84% varlocs libdw.so [.] dwarf_tag
4.62% varlocs libdw.so [.] walk_children.4364
2.35% varlocs libdw.so [.] __libdw_findabbrev
2.32% varlocs libdw.so [.] Dwarf_Abbrev_Hash_find
1.26% varlocs libdw.so [.] dwarf_child
Patch 1 addresses form_val_len with an inlined fast path for forms with
constant length. Patch 2 is a rework of get_uleb128 and get_sleb128,
which are significant in find_attr and elsewhere. Patch 3 addresses the
hash lookup which is called often to find DIE abbreviations.
The perf profile now looks like this:
Samples: 229K of event 'cycles', Event count (approx.): 213925592727
44.63% varlocs libdw.so [.] __libdw_find_attr
22.28% varlocs libdw.so [.] dwarf_siblingof
7.64% varlocs libdw.so [.] walk_children.4388
7.07% varlocs libdw.so [.] dwarf_tag
5.18% varlocs libdw.so [.] __libdw_findabbrev
2.88% varlocs libdw.so [.] __libdw_form_val_compute_len
2.11% varlocs libdw.so [.] dwarf_child
1.44% varlocs libdw.so [.] __libdw_formref
1.12% varlocs libdw.so [.] scope_visitor
The remaining busy work is simply walking through attributes, from DIE
to DIE. I believe optimizing this further will be hard without keeping
track of DIE lengths somewhere, which is a lot to cache. Putting the
length in Dwarf_Die itself is not feasible, because those are short-
lived and frequently recreated.
Here's some summary information for how these patches change varloc -k:
libdw varlocs varlocs
text time maxres
Base: 243072 84.42s 242360k
P1: 243296 74.91s 242356k
P2: 243184 70.61s 242360k
P3: 243600 56.75s 243588k
My timings are not statistically rigorous measurements, but it still
seems a clear win across the board. Other benchmarks I've tried, like
tests/allfcts and stap -l syscall.*, show similar improvement.
Feedback is always appreciated.