Tracing the Malloc API
The malloc enhancements provided on the dj/malloc branch include a malloc API using new whole-system benchmarking trace infrastructure created specifically for the malloc glibc subsystem. Similar instructions are available on the dj/malloc branch under malloc/README.copr.html.
Introduction
One key new feature in this malloc is a high-speed trace buffer that records every malloc, free, etc call with a minimum of added latency. This is an improvement over the existing trace feature for applications that are performance-critical.
Capturing to the trace buffer
The code for malloc tracing is in dj/malloc and is available when you build this glibc with USE_MTRACE defined to 1 in malloc/malloc.c e.g. #define USE_MTRACE 1.
A build with tracing enabled starts with all the trace code compiled in place but the tracing is disabled. In order to turn tracing on you need to preload libmtrace-ctl.so which starts tracing as a constructor and stops tracing as a destructor.
$ LD_PRELOAD=/lib64/libmtracectl.so ls
As a preloaded library it will construct first and destruct last, you will miss tracing some malloc in the early dynamic loader, but otherwise you will get coverage for all of the application malloc uses.
The tracing writes a single trace file which can be controlled via the env var MTRACE_CTL_FILE, which by default is /tmp/mtrace.out.$PID.
Workload Simulator
This build also includes a set of tools to "play back" a recorded trace, which can be helpful in diagnosing memory-related performance issues.
$ cd src/glibc/malloc $ gcc -o trace2wl trace2wl.c
Such workloads might be locally generated as part of a benchmark suite, for example.
$ trace2wl <outfile> [<infile>]
If an infile is not provided, input is read from stdin.
$ trace2wl /tmp/ls.wl /tmp/mtrace-22172.out
The resulting file is a "workload" - a data file that tells the simulator how to play back all the malloc/free/etc calls. This file is not human-readable, but a compact binary datafile intended to be used only by the simulator.
$ trace_run workload.wl
Note: trace_run only works on intel processors with the RDTSCP opcode, which is only available on reasonably modern processors. To see if your processor supports this opcode, look for the rdtscp cpu flag:
$ grep rdtscp /proc/cpuinfo
If you get lines like "flags : " then you have support and trace_run will work. If the grep returns nothing, you don't.
Run the trace in the simulator:
$ trace_run /tmp/ls.wl 488,004 cycles 106 usec wall time 0 usec across 1 thread 0 Kb Max RSS (1,228 -> 1,228) Avg malloc time: 385 in 154 calls Avg calloc time: 0 in 1 calls Avg realloc time: 0 in 1 calls Avg free time: 194 in 14 calls Total call time: 62,033 cycles