How gdb loads symbol files

The purpose of this page is to take a look at what happens when using the file command, more precisely what algorithms and data structures gdb uses to parse and store information about symbols. Obviously, gdb supports a vast combination of platforms, architectures and debug info formats. This focusses on what is probably the most common case, Linux on x86 with DWARF debug info.

The file command is essentially the exec-file command followed by the symbol-file command. exec-file does the necessary to open the executable and sets it as the current executable for the selected inferior. symbol-file indicates that gdb should use this file as a reference for symbols and debug information. This includes translating a symbol (function or variable name) into an address, a line number into a code address or vice-versa, etc. The symbol-file command makes gdb forget about any previously loaded symbols (FIXME: only for that inferior? for all of them?) and replaces them by those coming from the specified file.

objfile instantiation

The first thing gdb needs to do is to create an objfile object that represents mybinary. For gdb, an objfile consists of one file from which it reads symbols or debug info. Therefore, mybinary itself will be represented by one objfile instance. If mybinary loads some shared libraries, gdb will read symbols from them as well, and will instantiate one objfile for each .so file. In the case where you have debug info in a separate file from the binary (common with various distribution's package managers), one objfile will be instantiated for the executable file (gdb will still be able to read some symbols from it) and another one will be for the debug info file.

gdb has a concept of main objfile. The main objfile is the one the user has specified with the symbol-file (or transitively file) command. There is only one main objfile per inferior at any time. Other objfiles, such as the ones associated to shared libraries, are not main objfiles.

When allocating the objfile, gdb starts by building a section table representing the sections of the executable. Then, the program space of the inferior is associated to the objfile and a flag is set in that objfile to indicate that new sections related to that program space have been loaded.

The created objfile is added to the linked list of all objfiles.

symbol reading

The next step is to start reading symbols from the objfile. By symbols, we mean the mapping between a textual label (such as a function or variable name) and the address where the object of interest is found. It corresponds to the content of the ELF symbol table which you can see when using “readelf --syms” on your executable. Even if the executable does not contain debug symbols, it may include a symbol table if it has not been stripped. gdb calls these minimal symbols. As the name suggests, it is the very least meta-data you can have about your program. It is "better than nothing". If you don’t have full debug infos, minimal symbols will at least allow you to locate the start of a function in order to place a breakpoint there or guess in which function a certain code address is.

After minimal symbols are read, gdb will try to read more advanced information. In particular, it will probe for DWARF debug informations and load it if present. This allows for a more “complete” debugging experience, providing an extensive mapping between source code and instructions. Since this info represents a lot of data, it is done in two steps. A first quick pass over the debug info is done to build partial symbol tables. These contain just enough information to know what informations can be found in the debug info. When some specific information is required, the corresponding debug info section is read again in more details to build the full symbol table. Note that the user can force gdb to skip the partial symbol table step and immediately read the full debug symbols by using --readnow.

The way to read symbols from an objfile depends very much on the format of the executable. Symbols are not read the same way from an ELF binary than they are from a COFF binary. To keep the main algorithm independent from the binary format, gdb loads a particular set of functions to do different symbol-related operations, which depends on the binary format. One such set of functions is implemented for each executable format that has support for symbols. In our case, gdb finds that the executable is an ELF and saves in the objfile object a pointer to the ELF-specific symbol functions (elf_sym_fns).

First, it call the sym_new_init (elf_new_init) callback, which reinitializes the data structures associated with the machinery related to symbol table building. This callback resets all the state of the symbol reader to make it ready to read a completely new set of objfiles. This is only done if the objfile we are reading is loaded as the main objfile.

Then, the sym_init callback is invoked. This one is also responsible for initializing some state of the symbol reader. The difference between this one and sym_new_init is that it is called for each and every new objfile loaded. In case of the ELF version, it does not do any significant work.

The following invoked callback is sym_offsets. It is responsible for (TODO...).

Finally, the actual symbol reading can take place: the sym_read callback is used. The first step is to read minimal symbols.

minimal symbols reading

An executable contains two different symbol tables:

.symtab, which contains all the symbols defined in the executable;
.dynsym, which contains only the symbols needed for dynamic library loading operation. In the case of an executable, this means only references to symbols it needs to find in shared objects it depends on in order to run. For shared objects, it’s the same, plus the symbols it exposes to other objects.

In fact, .dynsym is generally a subset of .symtab. The purpose of keeping some duplicate information is that when a binary is stripped, the .symtab section is removed from the executable. The .dynsym section is kept so that dynamic loading still works. This means that when both sections are present, it is pointless to read the .dynsym section, since the same information is found in .symtab, which we need to read anyway. Therefore, gdb will adapt if both sections are present to avoid reading some symbols twice.

So, the ELF symbol reading code iterates on the symbols provided by the BFD library and, for each of the relevant ones, asks the minsyms machinery to create an instance of a struct minimal_symbol. The structure is put in some kind of staging area (called minsym bunches), I suppose to avoid memory fragmentation and allocation overhead.

(TODO: a word about special handling of got/plt sections? synthetic symbols?)

Once all the symbols are created, the ELF code asks the minsyms code to install the minimal symbols in the objfile, which consists of copying the structures from the staging area allocated by the minsyms module to some space belonging to the objfile. Symbols are stored sorted by order of increasing address. The table is then compacted by removing any possible duplicate entries.

Finally, a hash table is built to map symbol names to addresses. It is therefore quick to search the address matching a given symbol. Finding a symbol matching a given address is also possible by searching the sorted table dichotomically.

At this point, the objfile has a table of minimal symbols loaded an ready to be used. The work done so far can be done on any ELF executable, even those without debug info. The ELF symbol parser now goes onto the task of parsing the DWARF debug information.

debug info loading - partial symbol tables

The first thing gdb needs to do in order to load the DWARF info is to detect if there is actually some present in the executable. To do so, it iterates on the sections of the executable to see if any of them is related to DWARF. gdb will save in the objfile structure some references to any relevant section it finds (such as the .debug_info section).

As mentioned earlier, the loading of debug info first generates partial symbol tables (here, but really ends up here). gdb does a quick pass over the .debug_info section, which describes all types, functions, declared variables, blocks of code, etc.

If an executable comes from multiple compilation units (e.g. main.o and other.o), the .debug_info section of the final executable is the concatenation of the .debug_info sections of all compilation units. gdb starts by creating an index of all compilation units' debug info by recording their offset in the .debug_info section. To iterate over all compilation units' debug info, it uses the length found in the header of each one of them to jump to the next.

Once the index is built, the debug info of each compilation unit is processed individually.

The header of the compilation unit’s debug info is digested first followed by the abbreviations table. Then, the debug info entries (DIE) representing the compilation unit is read (here, but it ends up here). At this point, a struct partial_symtab is created for this compilation unit. An interesting thing to note is that the partial_symtab structure contains a pointer to the function to be used in order to read the corresponding full symbol table.

An address map is maintained for the whole objfile. It maps ranges of code addresses to compilation units, allowing you to know very quickly from which compilation unit a given pc comes from. The pc range for the compilation unit is easily obtained from the compilation unit DIE.

Then, the children of the compilation unit DIE are read, although gdb skips a lot of DIEs that are not useful for a partial symbol table. Even for DIEs that it does read, for some attributes, it just sets a flag indicating that this attribute is present. An example of interesting DIE to explore at this point is one representing an enumeration type. Since, in C at least, enumeration labels are in the global scope, we need to read them now, so that we know what the user refers to if he uses one of them in an expression. Subprogram children, however, are irrelevant at this point. We only need to know the function names, but we don’t need to know about their arguments right away.

For each interesting DIE, a struct partial_die_info is created, containing this basic information, such as the name. Depending on the kind of entry, the struct might be added to a hash table mapping the offset in the .debug_info section to the partial DIE structure. This way, if this DIE is referenced by offset at some point by another DIE, it will be possible to directly retrieve the partial_die_info structure we created for it.

Once all relevant partial_die_info structures have been created, they are all visited in order to extract the valuable information and finally fill the partial symbol table. Essentially, function names, variable names, enumeration labels, etc, are added to one of two lists in the objfile. The latter contains a list for global/external symbols (those visible across compilation units) and another for static symbols, only visible from within the compilation unit. These lists in the objfile are actually temporary, as they are latter sorted (only the global one, not sure why) and copied to the partial_symtab structure.

The next step is to create another kind of symbol table, the file symbol tables. These are generated from the line number information contained in the DWARF information, and allow mapping code addresses to a particular line of source code, and vice-versa. Again, for optimization purposes, gdb does not generate and read the whole line number tables right away. During the initial stage, it goes through the line number information of all compilation units only to gather the names of all source and included files. The purpose of the file partial symbol table is essentially to say that the file exists. If the user uses the filename in an expression, gdb will know what he is talking about and will be able to go back to the DWARF to read the full information about the file. So, for each source and include file that was used to generate a compilation unit, gdb creates one partial_symtab structure. If a particular header file is included in multiple compilation units, as is often the case, it will cause as many partial file symbol table to be created.

Conclusion

This is pretty much all that happen when the user issues the file command on a common ELF executable containing DWARF debug information. At this point, the initial stage of symbol parsing has been completed: partial symbols have been generated to model the executable. To sum up, gdb has done the following work:

created an objfile to represent the executable;
read minimal symbols from the ELF’s symbol table;
created a compilation unit partial symbol table for each compilation unit in the executable;
stored two lists of partial symbols in the objfile, for external and static names;
created a file partial symbol table for each involved source/header file.

This is pretty much all the information gdb needs in order to understand the expressions given by the user: file names, function names, type names... This is just enough to understand any query from the user and guide gdb to the place where it can find the remaining information, which will be fetched on an as-needed basis.

Actually, the very last step executed when using the file command is to initialize the information about the entry point of the program (i.e. the main function). For various reasons, gdb wants to know where it is located, so it looks up the symbol. This triggers for the compilation unit and file symtabs containing the entry point to be expanded and read into full symtabs.

None: How gdb loads symbol files (last edited 2015-02-15 01:11:15 by SimonMarchi)