Created attachment 14046 [details]
This directory contains everything needed to reproduce the problem.
The call tree output is not correct for this example I ran on an Intel Skylake based system. The code has been parallelized using Pthreads and we should see function "start_thread" in the call tree. It is not there though and this looks like an issue related to stack unwind.
This is the output I get:
Functions Call Tree. Metric: Attributed Total CPU Time
4.712 | +-driver_mxv
4.712 | +-mxv_core
0.050 | +-drand48
0.039 | +-erand48_r
0.014 | +-__drand48_iterate
I used gcc 10 and did not enable any optimizations, but I also see this problem if I use -O for example.
On an older Intel Haswell based system, I do see start_thread in the call tree.
The attachment has everything needed to reproduce the problem. The code is in directory "src" and can be built with "make". On purpose I left my objects and the binary in, as well as the experiment directory. There is a run.sh script that was used to show the problem. Sample output of this script is in run.res.