The gprofng Application Profiling Tool

Work in progress

Contents

What is gprofng?
The main features of gprofng
The gprofng tools
A first set of examples
Display source code and assembly listings
Scripting
Support for multithreading
Hardware event counters
1. What are hardware event counters?
2. How to select the events to be monitored
How does the data collection work?
Tips and tricks
1. Build gprofng for 32 bit profiling
Frequently Asked Questions (FAQ)
Known Limitations

1. What is gprofng?

Gprofng is a next generation application profiling tool. It supports the profiling of programs written in C, C++, Java, or Scala running on systems using processors from Intel, AMD, or Arm. The extent of the support is processor dependent, but the basic views are always available.

Two distinct steps are needed to produce a profile. In the first phase, the performance data is collected. This information is stored in a directory called the experiment directory. Several tools are available to display and analyze the information stored in the directory.

2. The main features of gprofng

These are some of the features:

Existing executables can be profiled, even if source code is not available.
Show where the time is spent at the function, source code and instruction level.
The controls that define the display(s) can be specified on the command line, and/or in a file.
The user can select the performance metrics to be displayed.
Transparent support for multithreaded applications parallelized using Pthreads, OpenMP, or Java Threads.
The results can be viewed while the profile job is still running.
Multiple experiments may be loaded simultaneously. By default, the data is aggregated.
Using two or more experiments, instead of aggregating the results, the data may also be compared.
On selected processors, profile hardware event counters (e.g. cache misses).
Filters allow the user to select one or more threads, a window in time, and/or call stacks.

3. The gprofng tools

Several tools, or commands, are part of the gprofng suite. All gprofng tools start with the gprofng keyword, followed by an action, optional additional keywords and the executable command to be profiled.

For example, this is how to collect performance data on an executable called a.out that takes the -p demo option:

$ gprofng collect app ./a.out -p demo

This is a one line summary that describes the functionality supported:

gprofng collect app	collect the performance data
gprofng display text	display the performance results in ASCII format
gprofng display html	generate an html structure to be navigated in a browser
gprofng display src	display the source code interleaved with the instructions
gprofng archive	archive an experiment directory

4. A first set of examples

4.1. About the example program

This program is written in C and uses Pthreads to parallelize a matrix-vector algorithm, multiplying an m x n matrix with a vector of length n. The program supports the -m and -n options to set the matrix sizes. The -t option sets the number of threads and -h prints a usage overview. The algorithm is executed repeatedly in order to increase the timings. This repeat count can be adjusted using the -r option. The -v option enables verbose mode. All options have defaults, so the program can be executed without any options.

This is an example how to run the program. A 5000 x 4000 matrix is used and the thread count is set to 2. The program prints a single line with the execution status, the matrix sizes and number of threads used:

$ ./mxv-pthreads -m 5000 -n 4000 -t 2
mxv: error check passed - rows = 5000 columns = 4000 threads = 2
$

Clearly, everything went well and the next step is to make a profile of this job.

4.2. How to get a basic profile

We now want to make a profile in order to see where the time was spent.

The only thing that needs to be done in order to collect the performance information is to run the same job, but now under control of gprofng. To this end, we use the gprofng collect app command:

$ gprofng collect app ./mxv-pthreads -m 5000 -n 4000 -t 2
Creating experiment directory test.1.er (Process ID: 2607338) ...
mxv: error check passed - rows = 5000 columns = 4000 threads = 2

The second line shown above is printed by gprofng. It tells us that an experiment directory with the name test.1.er has been created. It also shows the process id. This line is followed by the program output.

The experiment directory is a regular Linux directory and contains the information that was generated while the program was running. The name test.1.er is a default name. On a subsequent profiling run, a directory called test.2.er will be created, and so on.

In the second step, we can extract the information stored in the experiment directory using the gprofng display text command. If invoked without additional commands, the interpreter is launched and the user can issue commands within this interactive environment:

$ gprofng display text test.1.er
Warning: History and command editing is not supported on this system.
(gp-display-text) quit
$

Although it is perfectly valid to use this feature, in practice it is probably easier to add one or more additional commands, or as we shall see later, put the commands in a script file.

This is an example how to extract information from the experiment directory. The -functions command displays a table with the functions executed and the values for the metrics. By default, the total CPU time is shown:

$ gprofng display text -functions test.1.er
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
5.554 100.00  5.554 100.00   <Total>
5.274  94.95  5.274  94.95   mxv_core
0.140   2.52  0.270   4.86   init_data
0.090   1.62  0.110   1.98   erand48_r
0.020   0.36  0.020   0.36   __drand48_iterate
0.020   0.36  0.130   2.34   drand48
0.010   0.18  0.010   0.18   _int_malloc
0.      0.    0.280   5.05   __libc_start_main
0.      0.    0.010   0.18   allocate_data
0.      0.    5.274  94.95   collector_root
0.      0.    5.274  94.95   driver_mxv
0.      0.    0.280   5.05   main
0.      0.    0.010   0.18   malloc
0.      0.    5.274  94.95   start_thread

We see 5 columns. In the first and second column, the exclusive total CPU time is shown, both as a number and a percentage. The next two columns contain the values for the inclusive total CPU time. The function names are in the last column.

The first line in the table shows the values for <Total>. This is a pseudo function, generated by gprofng. The total value(s) for the metric(s) are shown here. The percentages are relative to this number.

Note that the very first line above shows the metric used to sort the data. By default, the metric displayed in the first column is used in the sort, but this is one of things that can be modified.

4.3. A first example of customization

While the example shown above works perfectly fine, some customization is desired. For example, the default name for the experiment directory is not very descriptive.

There are two options to do this, but regardless of the choice, in both cases the name of an experiment directory has to end with .er. With the -o option, the name of the experiment directory can be defined, but a directory with the same name will not be overwritten. This is a safety net to prevent accidental removal of experiment data. This is different if the -O option is used: an experiment directory with the same name will be silently overwritten. A new directory with the specific name is created before the data collection starts.

In our case, we prefer to use the -O option to give the experiment a more meaningful name.

Before we show the results of this run, there is another thing we want to change. This is a short running job. Each call to the matrix-vector function is very short and the default sampling rate may be too coarse grained. This is why we use the -p option to change the sampling rate and we set it to high.

This is the command that uses -O mxv.thr.2.er to specify the experiment directory name and increase the sampling rate through the -p high option:

$ gprofng collect app -O mxv.2.thr.er -p high ./mxv-pthreads -m 5000 -n 4000 -t 2
Creating experiment directory mxv.2.thr.er (Process ID: 2607361) ...
mxv: error check passed - rows = 5000 columns = 4000 threads = 2

The output confirms that the experiment directory is indeed set to mxv.2.thr.er.

The next step is to use the gprofng display text tool to get the profile on the screen. Also here we would like to customize the settings and use the -limit 10 command to restrict the output of the function overview to 10 lines:

$ gprofng display text -limit 10 -functions mxv.2.thr.er
Print limit set to 10
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
5.775 100.00  5.775 100.00   <Total>
5.494  95.15  5.494  95.15   mxv_core
0.126   2.18  0.267   4.63   init_data
0.068   1.17  0.104   1.80   erand48_r
0.038   0.66  0.142   2.45   drand48
0.036   0.62  0.036   0.62   __drand48_iterate
0.008   0.14  0.013   0.22   _int_malloc
0.003   0.05  0.005   0.09   sysmalloc
0.002   0.03  0.002   0.03   brk
0.      0.    0.002   0.03   __default_morecore

The output confirms that the limit has been set to 10. Note that the pseudo function <Total> counts as a function too and therefore the statistics for 9 of the functions that were executed during this run are shown.

We would like to conclude this section with some additional customization. By default, the exclusive times are shown first and the name of the function is shown last. In the next example it is shown how to display the function name first, followed by the inclusive and exclusive metrics, and sorted by the exclusive total CPU time.

First, we need to know the name of the metrics that are available. Some of these are actually not necessarily shown by default.

The -metric_list command shows the metrics that have been recorded and are available to be displayed. The current selection and the metric used to sort the data is shown as well:

$ gprofng display text -metric_list mxv.2.thr.er
Current metrics: e.%totalcpu:i.%totalcpu:name
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
Available metrics:
Exclusive Total CPU Time: e.%totalcpu
Inclusive Total CPU Time: i.%totalcpu
                    Size: size
              PC Address: address
                    Name: name

To change the default metrics, the -metrics command needs to be used. For our purpose we set it to the following:

-metrics name:i.%totalcpu:e.%totalcpu

With this definition, for each function executed, the name, plus the inclusive and exclusive total CPU times, as well as the percentages, are shown. Putting this altogether, we get the following command to display a table with the metrics and order desired:

$ gprofng display text -metrics name:i.%totalcpu:e.%totalcpu -limit 10 -functions mxv.2.thr.er
Current metrics: name:i.%totalcpu:e.%totalcpu
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
Print limit set to 10
Functions sorted by metric: Exclusive Total CPU Time

Name                Incl. Total   Excl. Total
                    CPU           CPU
                     sec.      %   sec.      %
 <Total>            5.775 100.00  5.775 100.00
 mxv_core           5.494  95.15  5.494  95.15
 init_data          0.267   4.63  0.126   2.18
 erand48_r          0.104   1.80  0.068   1.17
 drand48            0.142   2.45  0.038   0.66
 __drand48_iterate  0.036   0.62  0.036   0.62
 _int_malloc        0.013   0.22  0.008   0.14
 sysmalloc          0.005   0.09  0.003   0.05
 brk                0.002   0.03  0.002   0.03
 __default_morecore 0.002   0.03  0.      0.

When using the commands that control the information displayed, be aware that they are order dependent. For example, the -limit command needs to be used before the -functions command. Otherwise, the limit will not have any effect.

The -sort command overrides the default metric used to sort the data. Any of the metrics recorded in the experiment can be used for this. For example, the following command sorts the function overview by name:

$ gprofng display text -metrics name:i.%totalcpu:e.%totalcpu -sort name -limit 10 -functions mxv.2.thr.er
Current metrics: name:i.%totalcpu:e.%totalcpu
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
Current Sort Metric: Name ( name )
Print limit set to 10
Functions sorted by metric: Name

Name                Incl. Total   Excl. Total
                    CPU           CPU
                     sec.      %   sec.      %
 <Total>            5.775 100.00  5.775 100.00
 __default_morecore 0.002   0.03  0.      0.
 __drand48_iterate  0.036   0.62  0.036   0.62
 __libc_start_main  0.280   4.85  0.      0.
 _int_malloc        0.013   0.22  0.008   0.14
 allocate_data      0.013   0.22  0.      0.
 brk                0.002   0.03  0.002   0.03
 collector_root     5.494  95.15  0.      0.
 drand48            0.142   2.45  0.038   0.66
 driver_mxv         5.494  95.15  0.      0.

5. Display source code and assembly listings

So far we have focused on the function level overview, but gprofng can go deeper than that and show the performance information at the source and disassembly level.

Source code listings are shown using the -source command. The listing is annotated with the performance metrics on a line by line basis.

Below an example that shows how to get the source level information for function mxv_core. Since the source code of this function is part of a larger source file, we only show the relevant part. The full output contains the information for all source lines executed.

We define the metric to be the exclusive total CPU time (more on this choice in section Known limitations).

$ gprofng display text -metrics e.totalcpu -source mxv_core mxv.2.thr.er
Current metrics: e.totalcpu:name
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
Source file: /home/demo/mxv-pthreads/src/mxv.c
Object file: mxv-pthreads (found as mxv.2.thr.er/archives/mxv-pthreads_OttVjvQz3cb)
Load Object: mxv-pthreads (found as mxv.2.thr.er/archives/mxv-pthreads_OttVjvQz3cb)

   Excl.
   Total
   CPU sec.

   <lines omitted>
                        <Function: mxv_core>
   0.               35. void __attribute__ ((noinline)) mxv_core (int64_t row_index_start,
                    36.                                           int64_t row_index_end,
                    37.                                           int64_t m, int64_t n,
                    38.                                           double **restrict A,
                    39.                                           double *restrict b,
                    40.                                           double *restrict c)
   0.               41. {
   0.               42.   for (int64_t i=row_index_start; i<=row_index_end; i++)
                    43.     {
   0.               44.       double row_sum = 0.0;
## 3.018            45.       for (int64_t j=0; j<n; j++)
## 2.477            46.         row_sum += A[i][j] * b[j];
   0.               47.       c[i] = row_sum;
                    48.     }
   0.               49. }

The output echoes the settings for the metric(s) selected and the metric used to sort the data. There is also information where the object comes from, followed by the annotated source lines.

From this, it is clear that the time is spent in the for-loop spanning lines 45-46.

Similar information can be obtained at the instruction/disassembly level by using the -disasm command. This displays the instructions with the source code lines interleaved. Each instruction shown here has the [<n>] marker with the source line number where the instruction originates from. If a line number is not known, a question mark (?) will be shown. This may happen in the case of (external) library function, or if the source has not been compiled with the -g} option.

$ gprofng display text -metrics e.totalcpu -disasm mxv_core mxv.2.thr.er
Current metrics: e.totalcpu:name
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
Source file: /home/demo/mxv-pthreads/src/mxv.c
Object file: mxv-pthreads (found as mxv.2.thr.er/archives/mxv-pthreads_OttVjvQz3cb)
Load Object: mxv-pthreads (found as mxv.2.thr.er/archives/mxv-pthreads_OttVjvQz3cb)

   Excl.
   Total
   CPU sec.

   <lines omitted>
                    35. void __attribute__ ((noinline)) mxv_core (int64_t row_index_start,
                    36.                                           int64_t row_index_end,
                    37.                                           int64_t m, int64_t n,
                    38.                                           double **restrict A,
                    39.                                           double *restrict b,
                    40.                                           double *restrict c)
                    41. {
                        <Function: mxv_core>
   0.                   [41]   40223b:  mov    0x8(%rsp),%r10
                    42.   for (int64_t i=row_index_start; i<=row_index_end; i++)
   0.                   [42]   402240:  cmp    %rsi,%rdi
   0.                   [42]   402243:  jg     0x47
   0.                   [42]   402245:  add    $0x1,%rsi
   0.                   [42]   402249:  jmp    0x36
                    43.     {
                    44.       double row_sum = 0.0;
                    45.       for (int64_t j=0; j<n; j++)
                    46.         row_sum += A[i][j] * b[j];
   0.                   [46]   40224b:  mov    (%r8,%rdi,8),%rdx
   0.                   [45]   40224f:  mov    $0x0,%eax
   0.                   [44]   402254:  pxor   %xmm1,%xmm1
   0.012                [46]   402258:  movsd  (%rdx,%rax,8),%xmm0
   0.890                [46]   40225d:  mulsd  (%r9,%rax,8),%xmm0
   1.574                [46]   402263:  addsd  %xmm0,%xmm1
## 2.997                [45]   402267:  add    $0x1,%rax
   0.021                [45]   40226b:  cmp    %rax,%rcx
   0.                   [45]   40226e:  jne    0xffffffffffffffea
                    47.       c[i] = row_sum;
   0.                   [47]   402270:  movsd  %xmm1,(%r10,%rdi,8)
   0.                   [42]   402276:  add    $0x1,%rdi
   0.                   [42]   40227a:  cmp    %rsi,%rdi
   0.                   [42]   40227d:  je     0xd
   0.                   [44]   40227f:  pxor   %xmm1,%xmm1
   0.                   [45]   402283:  test   %rcx,%rcx
   0.                   [45]   402286:  jg     0xffffffffffffffc5
   0.                   [45]   402288:  jmp    0xffffffffffffffe8
                    48.     }
                    49. }
   0.                   [49]   40228a:  ret
$

From this listing it is seen which instructions contribute most to the total CPU time.

6. Scripting

By now, the command to get a customized profile on the screen has become quite lengthy. This is where scripting comes in to make life easier. In the context of gprofng, scripting is the feature to include commands in a file, the script file, and have these executed through the -script <name-of-scriptfile> command when executing gprofng display text.

While the word scripting is somewhat of a misnomer since there are no control flow structures, the feature is very useful nonetheless.

The script file contains the commands to be executed. The names of the commands are the same, but the hyphen (-) is omitted. Comment lines are those lines starting with the # symbol.

For example, to get a table with the inclusive and exclusive total CPU times plus percentages, limited to the first 10 lines, the following script file may be used:

$ cat myscript
# Command to define the metrics
metrics name:i.%totalcpu:e.%totalcpu
# Limit the views to 10 lines
limit 10
# Display the function overview
functions
$

This file is then used in the following way:

$ gprofng display text -script myscript mxv.2.thr.er
# Command to define the metrics
Current metrics: name:i.%totalcpu:e.%totalcpu
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
# Limit the views to 10 lines
Print limit set to 10
# Display the function overview
Functions sorted by metric: Exclusive Total CPU Time

Name                Incl. Total   Excl. Total
                    CPU           CPU
                     sec.      %   sec.      %
 <Total>            5.775 100.00  5.775 100.00
 mxv_core           5.494  95.15  5.494  95.15
 init_data          0.267   4.63  0.126   2.18
 erand48_r          0.104   1.80  0.068   1.17
 drand48            0.142   2.45  0.038   0.66
 __drand48_iterate  0.036   0.62  0.036   0.62
 _int_malloc        0.013   0.22  0.008   0.14
 sysmalloc          0.005   0.09  0.003   0.05
 brk                0.002   0.03  0.002   0.03
 __default_morecore 0.002   0.03  0.      0.

The comment lines from the script file are printed and interleaved with the output from gprofng.

Note that in addition to the script file, commands may still be used when executing gprofng display text. They are processed in the order of appearance.

With the scripting feature it is easier to generate multiple tables and also with more customization of the output.

7. Support for multithreading

Nothing special needs to be done to generate a profile for a multithreaded application, parallelized with Posix Threads (or "Pthreads" for short), OpenMP, or Java Threads. As we did with the example, simply execute the application as usual, but under control of gprofng collect app. Through filters, customized tables may be created from the experiment directory.

Using our experiment directory, we can find out how many threads have been used, and get a distribution of the metrics across those threads.

The -thread_list command display the number of threads used for each experiment specified. In our case, there is only one experiment. The -threads command shows the metrics for each of the threads that has been executed. The command below shows how to obtain this information:

$ gprofng display text -thread_list -threads mxv.2.thr.er
Exp Sel Total
=== === =====
  1 all     3
Objects sorted by metric: Exclusive Total CPU Time

Excl. Total    Name
CPU
 sec.      %
5.775 100.00   <Total>
2.773  48.01   Process 1, Thread 3
2.722  47.13   Process 1, Thread 2
0.280   4.85   Process 1, Thread 1

The results confirm that only one experiment was loaded and show that 3, not 2, threads have actually been executed. This is by design of the code. The driver thread handles the logistics and initiates two computational threads. Clearly, this driver thread takes less than 5% of the time. Around 47-48% of the time is taken by the other two threads.

Now that we know how many threads there are, we can zoom in on the behavior of each individual thread. This is done through the -thread_select command. It takes a list of thread id's and aggregates the results across those threads. In this example, we want to break down the metrics for the individual threads. For lay-out purposes, the views are limited to 5 lines.

This is the profile for the driver thread with id 1:

$ gprofng display text -limit 5 -thread_select 1 -functions mxv.2.thr.er
Print limit set to 5
Exp Sel Total
=== === =====
  1 1       3
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
0.280 100.00  0.280 100.00   <Total>
0.126  44.84  0.267  95.37   init_data
0.068  24.20  0.104  37.01   erand48_r
0.038  13.52  0.142  50.53   drand48
0.036  12.81  0.036  12.81   __drand48_iterate

The first table confirms that thread 1 has been selected and that there are three threads in total.

What follows next is the function view for this particular thread. With over 44%, function init_data dominates the performance of the driver. The other 3 functions are related to the random number generator and collectively take over 50% of the total time.

Note that the inclusive time for init_data is over twice the exclusive time and with 95% this part of the call tree is largely responsible for the time spent in the driver.

The per-thread performance for the other 2 threads is shown below:

$ gprofng display text -limit 5 -thread_select 2 -functions mxv.2.thr.er
Print limit set to 5
Exp Sel Total
=== === =====
  1 2       3
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
2.722 100.00  2.722 100.00   <Total>
2.722 100.00  2.722 100.00   mxv_core
0.      0.    2.722 100.00   collector_root
0.      0.    2.722 100.00   driver_mxv
0.      0.    2.722 100.00   start_thread

$ gprofng display text -limit 5 -thread_select 3 -functions mxv.2.thr.er
Print limit set to 5
Exp Sel Total
=== === =====
  1 3       3
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
2.773 100.00  2.773 100.00   <Total>
2.773 100.00  2.773 100.00   mxv_core
0.      0.    2.773 100.00   collector_root
0.      0.    2.773 100.00   driver_mxv
0.      0.    2.773 100.00   start_thread

We already knew that both of these threads spend an equal amount of CPU time. The two tables show that both threads spend all of this time in function mxv_core.

8. Hardware event counters

8.1. What are hardware event counters?

Many modern processors, or cores, provide support for hardware event counters. These counters measure certain activities, or "events", within the core. Examples of such events are the number of instructions executed, or the number of level 1 data cache misses.

The events to be monitored are mapped onto the physical counters. The measured value that is returned, is either an dimensionless count how often the particular event occurred, or it is the number of CPU cycles. If the frequency is known, this number may be used to convert the count to a time. ¹

8.2. How to select the events to be monitored

The number of counters, and possible restrictions on the mapping, as well as the type of events that can be monitored, depend on the core. A description of the events can typically be found in the architecture reference manual for the core, but if gprofng supports the events for a core, the -h option on the gprofng collect app command shows, among other information, a list of the supported events. In case the events are not supported on the target core, a message to this extent will be printed.

This is an example of the first part of the output shown with the -h option:

$ gprofng collect app -h

Run "gprofng collect app --help" for a usage message.

Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 63' (cpuver=2499):

    -h {auto|lo|on|hi}
        turn on default set of HW counters at the specified rate
    -h <ctr_def> [-h <ctr_def>]...
    -h <ctr_def>[,<ctr_def>]...
        specify HW counter profiling for up to 4 HW counters

    <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]

... <additional text not displayed here> ...

The first line informs the user how to get the general usage information, followed by the details of the processor/core this command has been executed on.

The next set of lines describes how to specify the events to be monitored.

In case there is no list with events when executing the command, gprofng selects a default set of events in case the -h option is used in combination with a keyword from the set {auto,on,lo,hi}. This default depends on the specifics of the core and the actual settings are shown as part of the output obtained with the -h option.

The keywords in this set define the sampling granularity to be used:

auto - This is the default and matches the rate used by clock profiling. If clock profiling is disabled, use on.
lo - The per-thread maximum sampling rate is ~10 samples/second.
on - The per-thread maximum rate is ~100 samples/second.
hi - The per-thread maximum rate is ~1000 samples/second.

Already with this information, a hardware event counter experiment can be conducted by relying on the defaults. First, we generate the data by using the -h on option on the data collection command:

$ gprofng collect app -h on -O mxv.hwc.2.thr.er ./mxv-pthreads -m 5000 -n 4000 -t 2
Creating experiment directory mxv.hwc.2.thr.er (Process ID: 2483756) ...
mxv: error check passed - rows = 5000 columns = 4000 threads = 2
$

Like before, the information is displayed using the gprofng display text tool. Note that we do not have to change anything in order to see the counter values:

$ gprofng display text -limit 5 -functions mxv.hwc.2.thr.er
Print limit set to 5
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total   Excl. CPU     Excl. Instructions  Excl. Last-Level   Name
CPU           CPU           Cycles        Executed            Cache Misses
 sec.      %   sec.      %   sec.      %                   %                %
5.934 100.00  5.934 100.00  6.768 100.00  25736055837 100.00  56166696 100.00    <Total>
5.594  94.27  5.594  94.27  6.530  96.49  24199574850  94.03  55966494  99.64    mxv_core
0.180   3.04  0.330   5.56  0.042   0.62    192060028   0.75    200202   0.36    init_data
0.060   1.01  0.100   1.69  0.084   1.24    416130412   1.62         0   0.      erand48_r
0.050   0.84  0.150   2.53  0.042   0.62    288090158   1.12         0   0.      drand48
$

We see that on this core, three events (CPU cycles, instructions executed and last level cache misses) are monitored by default. The values and percentages are shown in columns 5-10.

The first four columns contain the information from clock profiling. This is enabled by default, but can be disabled with the -p off option.

Default set of HW counters:

    -h cycles,,insts,,llm

Aliases for most useful HW counters:

    alias      raw name                     type      units description

    cycles     unhalted-core-cycles              CPU-cycles CPU Cycles
    insts      instruction-retired                   events Instructions Executed
    llm        llc-misses                            events Last-Level Cache Misses
    br_msp     branch-misses-retired                 events Branch Mispredict
    br_ins     branch-instruction-retired            events Branch Instructions

Raw HW counters:

    name                                    type      units description

    unhalted-core-cycles                         CPU-cycles
    unhalted-reference-cycles                        events
    instruction-retired                              events
    llc-reference                                    events
    llc-misses                                       events

    ... <additional events removed> ...

9. How does the data collection work?

In the data collection phase, statistical call stack sampling is used.

The target application is executed under control of gprofng and at regular intervals (10ms by default), the program is stopped. At that point, it gathers information that is stored in the experiment directory. For example, the call stack is recorded using a stack unwind algorithm. The application then resumes execution until it is stopped again. Thanks to this, gprofng can essentially profile any executable and does not need to have access to the source code to collect the information.

Due to the statistical nature of this process, it is natural to see small differences across seemingly identical profiling jobs. In practice, this is not something to be worried about though. The variations are typically below 5%, but there is something else to pay attention to. It may be that a relatively small function is not represented well, because the sampling granularity is too high. In this case, the -p option can be used to increase the sampling rate. Through the same option, the sampling rate can also be decreased.

10. Tips and tricks

Work in progress, very crude material still

10.1. Build gprofng for 32 bit profiling

TBD

11. Frequently Asked Questions (FAQ)

Can I profile my application using gprofng?
- Applications written in C or C++, Java, or Scala are supported. Any executable in the ELF (Executable and Linkable Format) object format can be used for profiling with gprofng. If debug information is available, gprofng can provide more details, but this is not a requirement. Thanks to the preload mechanism used by gprofng and the sampling strategy, the application does not need to be recompiled, nor linked against special libraries. All the views are available, but if you would like to get source line level information, the compiler needs to include that. For example with the gcc compiler, the -g option does this.
How about profiling Fortran programs?
- Profiling Fortran 77 applications is supported, as long as their in ELF format. For more recent versions of the Fortran standard, some views may not be available, but the function and disassembly views should work. We are currently investigating the support in more detail and will update this page with information.
How does gprofng handle the stack unwind?
- The basic issue with x86 stack unwind is that gprofng doesn't know if it can trust the frame-pointer register. So it walks instructions to find a return instruction, at which point it knows the return address is on the top of the stack, etc.
  This is a challenge in the case of the jmp *(reg) instruction, where it is expected to jump to the (unknown to gprofng) contents of a register.
  The "jmp_reg" code attempts to keep track of the context for such a jump, deferring any handling of such a difficult case. Gprofng continues with other contexts, hoping that some other walk will take it to a return instruction. If no other walk helps, it returns to "jmp_reg" contexts.
  While gprofng doesn't know the jump target, it is possible that the bytes immediately following the jmp_reg instruction represent one possible target, as might be the case when a switch statement is compiled.
  Unfortunately, the bytes following a jmp_reg instruction might instead be a jump target from somewhere else, because execution might never fall through from the preceding jmp_reg. Those bytes might not even be instructions at all. There are many uses of jmp_reg instructions beyond just compiling switch statements. If the above approach fails and a return instruction is not found, gprofng tries to use the frame-pointer register.
  On Arm/aarch64 dladdr() from libdl.so is used. We're looking at Paul Brook's implementation (https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/unwind-arm.c).
  Note that all the code details can be found on the https://sourceware.org/binutils/ web site. The source code for the stack unwind is in file binutils-gdb.git/gprofng/libcollector/unwind.c.
Does gprofng work with the MUSL libc library?
- Gprofng depends on common libraries like libc, libpthread, and libdl.
  Since the MUSL libc implementation does not provide the get_context() function, gprofng does not work with this library. In particular, the data collection tool gp-collect-app will not work due to this missing function.
I get error messages from the semaphoring system. Should I be concerned?
- It may happen that during the data collection process, gprofng is not able to interrupt the program's execution. This may happen, because it is in a (system) call that does not allow interrupts. This is an example of the error returned:
```
  DDI Process 1: semop return an error performing 1 operation(s) on semid 196610.
  semop errno=EINTR.
  DDI Process 0: semop return an error performing 1 operation(s) on semid 196610.
  semop errno=EINT
```
  The EINTR error code is returned if a process receives a signal while it is blocked on a blocking system call, and that signal has a handler that is configured to interrupt rather than restart system calls.
  This happens because the application executes a call to the semop() function and gprofng tries to send a SIGPROF signal. This request is ignored and there is no sampling, but other than a dropped sample, the profiling results should be fine.

12. Known Limitations

The source and disassembly listings do not show the percentages correctly. They are all zero. This is a known problem that will be addressed in the future.
The function collector_root may show up in the function view, but also in other places. This is a function internal to gprofng and it should not consume any resources. In a future update we plan to hide this function from the views since it could be confusing to see it.

Go to the main binutils page

Be aware that this conversion may not be reliable if the processor frequency varies during the profile run. (1)