Prelink Jakub Jelínek Red Hat, Inc. jakub@redhat.com [ This version extracted from PDF with pdftotext and edited for clarity ] November 19, 2020 Abstract Prelink is a tool designed to speed up dynamic linking of ELF programs on various Linux architectures. It speeds up start up of ============================================================================ 1 Preface In 1995, Linux changed its binary format from a.out to ELF. The a.out binary format was very inflexible and shared libraries were pretty hard to build. Linux’s shared libraries in a.out are position dependent and each had to be given a unique virtual address space slot at link time. Maintaining these assignments was pretty hard even when there were just a few shared libraries, there used to be a central address registry maintained by humans in form of a text file, but it is certainly impossible to do these days when there are thousands of different shared libraries and their size, version and exported symbols are constantly changing. On the other side, there was just minimum amount of work the dynamic linker had to do in order to load these shared libraries, as relocation handling and symbol lookup was only done at link time. The dynamic linker used the uselib system call which just mapped the named library into the address space (with no segment or section protection differences, the whole mapping was writable and executable). The ELF binary format is one of the most flexible binary formats, its shared [ As described in generic ABI document [1] and various processor specific ABI supplements [2], [3], [4], [5], [6], [7], [8]. ] libraries are easy to build and there is no need for a central assignment of virtual address space slots. Shared libraries are position independent and relocation handling and symbol lookup are done partly at the time the executable is created and partly at runtime. Symbols in shared libraries can be overridden at runtime by preloading a new shared library defining those symbols or without relinking an executable by adding symbols to a shared library which is searched up earlier during symbol lookup or by adding new dependent shared libraries to a library used by the program. All these improvements have their price: - slower program startup - more non-shareable memory per process - runtime cost associated with position independent code in shared libraries Program startup of ELF programs is slower than startup of a.out programs with shared libraries, because the dynamic linker has much more work to do before calling program’s entry point. The cost of loading libraries is just slightly bigger, as ELF shared libraries have typically separate read-only and writable segments, so the dynamic linker has to use different memory protection for each segment. The main difference is in relocation handling and associated symbol lookup. In the a.out format there was no relocation handling or symbol lookup at runtime: In ELF, this cost is much more important today than it used to be during a.out to ELF transition in Linux, as especially GUI programs keep constantly growing and start to use more and more shared libraries. 5 years ago programs using more than 10 shared libraries were very rare, these days most of the GUI programs link against around 40 or more shared and in extreme cases programs use even more than 90 shared libraries. Every shared library adds its set of dynamic relocations to the cost and enlarges symbol search scope, so in addition to doing more symbol lookups, each symbol lookup the application has to perform is on average more expensive. Another factor increasing the cost is the length of symbol names which have to be compared when finding symbol in the symbol hash table of a shared library: C++ libraries tend to have extremely long symbol names and unfortunately the new C++ ABI puts namespaces and class names first and method names last in the mangled names, so often symbol names differ only in last few bytes of very long names. Every time a relocation is applied the entire memory page containing the address which is written to must be loaded into memory. The operating system does a copy-on-write operation which also has the consequence that the physical memory of the memory page cannot anymore be shared with other processes. With ELF, typically all of program’s Global Offset Table, constants and variables containing pointers to objects in shared libraries, etc. are written into before the dynamic linker passes control over to the program. On most architectures (with some exceptions like AMD64 architecture) position independent code requires that one register needs to be dedicated as PIC register and thus cannot be used in the functions for other purposes. This especially degrades performance on register-starved architectures like IA-32. Also, there needs to be some code to set up the PIC register, either invoked as part of function prologues, or when using function descriptors in the calling sequence. Prelink is a tool which (together with corresponding dynamic linker and linker changes) attempts to bring back some of the a.out advantages (such as the speed and less COW’d pages) to the ELF binary format while retaining all of its flexibility. In a limited way it also attempts to decrease number of nonshareable pages created by relocations. Prelink works closely with the dynamic linker in the GNU C library, but probably it wouldn’t be too hard to port it to some other ELF using platforms where the dynamic linker can be modified in similar ways. ============================================================================ 2 Caching of symbol lookup results Program startup can be speeded up by caching of symbol lookup results. Many shared libraries need more than one lookup of a particular symbol. This is especially true for C++ shared libraries, where e.g. the same method is present in multiple virtual tables or RTTI data structures. Traditionally, each ELF section which needs dynamic relocations has an associated .rela* or .rel* section (depending on whether the architecture is defined to use RELA or REL relocations). The relocations in those sections are typically sorted by ascending r_offset values. Symbol lookups are usually the most expensive operation during program startup, so caching the symbol lookups has potential to decrease time spent in the dynamic linker. One way to decrease the cost of symbol lookups is to create a table with the size equal to number of entries in dynamic symbol table (.dynsym) in the dynamic linker when resolving a particular shared library, but that would in some cases need a lot of memory and some time spent in initializing the table. Another option would be to use a hash table with chained lists, but that needs both extra memory and would also take extra time for computation of the hash value and walking up the chains when doing new lookups. Fortunately, neither of these are really necessary if we modify the linker to sort relocations so that relocations against the same symbol are adjacent. This has been done first in the Sun linker and dynamic linker, so the GNU linker and dynamic linker use the same ELF extensions and linker flags. Particularly, the following new ELF dynamic tags have been introduced: #define DT_RELACOUNT 0x6ffffff9 #define DT_RELCOUNT 0x6ffffffa New options -z combreloc and -z nocombreloc have been added to the linker. [ -z combreloc is the default in GNU linker versions 2.13 and later ] The latter causes the previous linker behavior, i.e. each section requiring relocations has a corresponding relocation section, which is sorted by ascending r_offset. -z combreloc instructs the linker to create just one relocation section for dynamic relocations other than symbol jump table (PLT) relocations. This single relocation section (either .rela.dyn or .rel.dyn) is sorted, so that relative relocations come first (sorted by ascending r_offset), followed by other relocations, sorted again by ascending r_offset. [ In fact sorting needs to include the type of lookup. Most relocations resolve to a PLT slot in the executable if there is one for the lookup symbol, because the executable might have a pointer against that symbol without any dynamic relocations. But e.g. relocations used for the PLT slots must avoid these ] If more relocations are against the same symbol, they immediately follow the first relocation against that symbol with lowest r_offset. The number of relative relocations at the beginning of the section is stored in the DT_RELACOUNT resp. DT_RELCOUNT dynamic tag. The dynamic linker can use the new dynamic tag for two purposes. If the shared library is successfully mapped at the same address as the first PT_LOAD segment’s virtual address, the load offset is zero and the dynamic linker can avoid all the relative relocations which would just add zero to various memory locations. Normally shared libraries are linked with first PT_LOAD segment’s virtual address set to zero, so the load offset is non-zero. This can be changed through a linker script or by using a special prelink option –reloc-only to change the base address of a shared library. All prelinked shared libraries have non-zero base address as well. If the load offset is non-zero, the dynamic linker can still make use of this dynamic tag, as relative relocation handling is typically way simpler than handling other relocations (since symbol lookup is not necessary) and thus it can handle all relative relocations in a tight loop in one place and then handle the remaining relocations with the fully featured relocation handling routine. ---------------------------------------------------------------------------- The second and more important point is that if relocations against the same symbol are adjacent, the dynamic linker can use a cache with single entry. ---------------------------------------------------------------------------- The dynamic linker in glibc, if it sees statistics as part of the LD_DEBUG environment variable, displays statistics which can show how useful this optimization is. Let’s look at some big C++ application, e.g. konqueror. If not using the cache, the statistics looks like this: runtime linker statistics: total startup time in dynamic loader: 270886059 clock cycles time needed for relocation: 266364927 clock cycles (98.3%) number of relocations: 79067 number of relocations from cache: 0 number of relative relocations: 31169 time needed to load objects: 4203631 clock cycles (1.5%) This program run is with hot caches, on non-prelinked system, with lazy binding. The numbers show that the dynamic linker spent most of its time in relocation handling and especially symbol lookups. If using symbol lookup cache, the numbers look different: total startup time in dynamic loader: 132922001 clock cycles time needed for relocation: 128399659 clock cycles (96.5%) number of relocations: 25473 number of relocations from cache: 53594 number of relative relocations: 31169 time needed to load objects: 4202394 clock cycles (3.1%) On average, for one real symbol lookup there were two cache hits and total time spent in the dynamic linker decreased by 50%. ============================================================================ 3 Prelink design Prelink was designed so as to require as few ELF extensions as possible. It should not be tied to a particular architecture, but should work on all ELF architectures. During program startup it should avoid all symbol lookups which, as has been shown above, are very expensive. It needs to work in an environment where shared libraries and executables are changing from time to time, whether it is because of security updates or feature enhancements. It should avoid big code duplication between the dynamic linker and the tool. And prelinked shared libraries need to be usable even in non-prelinked executables, or when one of the shared libraries is upgraded and the prelinking of the executable has not been updated. To minimize the number of performed relocations during startup, the shared libraries (and executables) need to be relocated already as much as possible. For relative relocations this means the library needs to be loaded always at the same base address, for other relocations this means all shared libraries with definitions those relocations resolve to (often this includes all shared libraries the library or executable depends on) must always be loaded at the same addresses. ELF executables (with the exception of Position Independent Executables) have their load address fixed already during linking. For shared libraries, prelink needs something similar to a.out registry of virtual address space slots. Maintaining such registry across all installations wouldn’t scale well so prelink instead assigns these virtual address space slots on the fly after looking at all executables it is supposed to speed up and all their dependent shared libraries. The next step is to actually relocate shared libraries to the assigned base address. When this is done, the actual prelinking of shared libraries can be done. First, all dependent shared libraries need to be prelinked (prelink doesn’t support circular dependencies between shared libraries, will just warn about them instead of prelinking the libraries in the cycle). Then for each relocation in the shared library prelink needs to look up the symbol in natural symbol search scope of the shared library (the shared library itself first, then breadth first search of all dependent shared libraries) and apply the relocation to the symbol’s target section. The symbol lookup code in the dynamic linker is quite complex and big, so to avoid duplicating all this, prelink has chosen to use dynamic linker to do the symbol lookups. Dynamic linker is told via a special environment variable it should print all performed symbol lookups and their type and prelink reads this output through a pipe. As one of the requirements was that prelinked shared libraries must be usable even for non-prelinked executables (duplicating all shared libraries so that there are pristine and prelinked copies would be very unfriendly to RAM usage), prelink has to ensure that by applying the relocation no information is lost and thus relocation processing can be cheaply done at startup time of non-prelinked executables. For RELA architectures this is easier, because the content of the relocation’s target memory is not needed when processing the relocation. [ Relative relocations on certain RELA architectures use relocation target’s memory, either alone or together with r_addend field. ] For REL architectures this is not the case. prelink attempts some tricks described later and if they fail, needs to convert the REL relocation section to RELA format where addend is stored in the relocation section instead of relocation target’s memory. When all shared libraries an executable (directly or indirectly) depends on are prelinked, relocations in the executable are handled similarly to relocations in shared libraries. Unfortunately, not all symbols resolve the same when looked up in a shared library’s natural symbol search scope (i.e. as it is done at the time the shared library is prelinked) and when looked up in application’s global symbol search scope. Such symbols are herein called conflicts and the relocations against those symbols conflicting relocations. Conflicts depend on the executable, all its shared libraries and their respective order. They are only computable for the shared libraries linked to the executable (libraries mentioned in DT_NEEDED dynamic tags and shared libraries they transitively need). The set of shared libraries loaded via dlopen(3) cannot be predicted by prelink, neither can the order in which this happened, nor the time when they are unloaded. When the dynamic linker prints symbol lookups done in the executable, it also prints conflicts. Prelink then takes all relocations against those symbols and builds a special RELA section with conflict fixups and stores it into the prelinked executable. Also a list of all dependent shared libraries in the order they appear in the symbol search scope, together with their checksums and times of prelinking is stored in another special section. The dynamic linker first checks if it is itself prelinked. If yes, it can avoid its preliminary relocation processing (this one is done with just the dynamic linker itself in the search scope, so that all routines in the dynamic linker can be used easily without too many limitations). When it is about to start a program, it first looks at the library list section created by prelink (if any) and checks whether they are present in symbol search scope in the same order, none have been modified since prelinking and that there aren’t any new shared libraries loaded either. If all these conditions are satisfied, prelinking can be used. In that case the dynamic linker processes the fixup section and skips all normal relocation handling. If one or more of the conditions are not met, the dynamic linker continues with normal relocation processing in the executable and all shared libraries. ============================================================================ 4 Collecting executables and libraries which should be prelinked Before the actual work can start the prelink tool needs to collect the filenames of executables and libraries it is supposed to prelink. It doesn’t make any sense to prelink a shared library if no executable is linked against it because the prelinking information will not be used anyway. Furthermore, when prelink needs to do a REL to RELA conversion of relocation sections in the shared library (see later) or when it needs to convert SHT_NOBITS PLT section to SHT_PROGBITS, a prelinked shared library might grow in size and so prelinking is only desirable if it will speed up startup of some program. The only change which might be useful even for shared libraries which are never linked against, only loaded using dlopen, is relocating to a unique address. This is useful if there are many relative relocations and there are pages in the shared library’s writable segment which are never written into with the exception of those relative relocations. Such shared libraries are rare, so prelink doesn’t handle these automatically, instead the administrator or developer can use prelink –reloc-only=ADDRESS to relocate it manually. Prelinking an executable requires all shared libraries it is linked against to be prelinked already. Prelink has two main modes in which it collects filenames. One is incremental prelinking, where prelink is invoked without the -a option. In this mode, prelink queues for prelinking all executables and shared libraries given on the command line, all executables in directory trees specified on the command line, and all shared libraries those executables and shared libraries are linked against. For the reasons mentioned earlier a shared library is queued only if a program is linked with it or the user tells the tool to do it anyway by explicitly mentioning it on the command line. The second mode is full prelinking, where the -a option is given on the command line. This in addition to incremental prelinking queues all executables found in directory trees specified in prelink.conf (which typically includes all or most directories where system executables are found). For each directory subtree in the config file the user can specify whether symbolic links to places outside of the tree are to be followed or not and whether searching should continue even across filesystem boundaries. There is also an option to blacklist some executables or directory trees so that the executables or anything in the directory trees will not be prelinked. This can be specified either on the command line or in the config file. Prelink will not attempt to change executables which use a non-standard dynamic linker for security reasons, because it actually needs to execute the dynamic linker for symbol lookup and it needs to avoid executing some random unknown executable with the permissions with which prelink is run (typically root, with the permissions at least for changing all executables and shared libraries in the system). [ Standard dynamic linker path is hardcoded in the executable for each architecture. It can be overridden from the command line, but only with one dynamic linker name (normally, multiple standard dynamic linkers are used when prelinking mixed architecture systems). ] The administrator should ensure that prelink.conf doesn’t contain world-writable directories and such directories are not given to the tool on the command line either, but the tool should be distrustful of the objects nevertheless. Also, prelink will not change shared libraries which are not specified directly on the command line or located in the directory trees specified on the command line or in the config file. This is so that e.g. prelink doesn’t try to change shared libraries on shared networked filesystems, or at least it is possible to configure the tool so that it doesn’t do it. For each executable and shared library it collects, prelink executes the dynamic linker to list all shared libraries it depends on, checks if it is already prelinked and whether any of its dependencies changed. Objects which are already prelinked and have no dependencies which changed don’t have to be prelinked again (with the exception when e.g. virtual address space layout code finds out it needs to assign new virtual address space slots for the shared library or one of its dependencies). Running the dynamic linker to get the symbol lookup information is a quite costly operation especially on systems with many executables and shared libraries installed, so prelink offers a faster -q mode. In all modes, prelink stores modification and change times of each shared library and executable together with all object dependencies and other information into prelink.cache file. When prelinking in -q mode, it just compares modification and change times of the executables and shared libraries (and all their dependencies). Change time is needed because prelink preserves modification time when prelinking (as well as permissions, owner and group). If the times match, it assumes the file has not changed since last prelinking. Therefore the file can be skipped if it is already prelinked and none of the dependencies changed. If any time changed or one of the dependencies changed, it invokes the dynamic linker the same way as in normal mode to find out real dependencies, whether it has been prelinked or not etc. The collecting phase in normal mode can take a few minutes, while in quick mode usually takes just a few seconds, as the only operation it does is it calls just lots of stat system calls. ============================================================================ 5 Assigning virtual address space slots Prelink has to ensure at least that for all successfully prelinked executables all shared libraries they are (transitively) linked against have non-overlapping virtual address space slots (furthermore they cannot overlap with the virtual address space range used by the executable itself, its brk area, typical stack location and ld.so.cache and other files mmaped by the dynamic linker in early stages of dynamic linking (before all dependencies are mmaped). If there were any overlaps, the dynamic linker (which mmaps the shared libraries at the desired location without MAP_FIXED mmap flag so that it is only soft requirement) would not manage to mmap them at the assigned locations and the prelinking information would be invalidated (the dynamic linker would have to do all normal relocation handling and symbol lookups). Executables are linked against very wide variety of shared library combinations and that has to be taken into account. The simplest approach is to sort shared libraries by descending usage count (so that most often used shared libraries like the dynamic linker, libc.so etc. are close to each other) and assign them consecutive slots starting at some architecture specific base address (with a page or two in between the shared libraries to allow for a limited growth of shared libraries without having to reposition them). Prelink has to find out which shared libraries will need a REL to RELA conversion of relocation sections and for those which will need the conversion count with the increased size of the library’s loadable segments. This is prelink behavior without -m and -R options. The architecture specific base address is best located a few megabytes above the location where mmap with NULL first argument and without MAP_FIXED starts allocating memory areas (in Linux this is the value of TASK_UNMAPPED_BASE macro). The reason for not starting to assign addresses in prelink immediately at TASK_UNMAPPED_BASE is that ld.so.cache and other mappings by the dynamic linker will end up in the same range and could overlap with the shared libraries. [ TASK_UNMAPPED_BASE has been chosen on each platform so that there is enough virtual memory for both the brk area (between executable’s end and this memory address) and mmap area (between this address and bottom of stack). ] Also, if some application uses dlopen to load a shared library which has been prelinked*, those few megabytes above TASK_UNMAPPED_BASE increase the probability that the stack slot will be still unused (it can clash with e.g. non-prelinked shared libraries loaded by dlopen earlier** or other kinds of mmap calls with NULL first argument like malloc allocating big chunks of memory, mmaping of locale database, etc.). * [ Typically this is because some other executable is linked against that shared library directly. ] ** [ If shared libraries have first PT_LOAD segment’s virtual address zero, the kernel typically picks first empty slot above TASK_UNMAPPED_BASE big enough for the mapping.] This simplest approach is unfortunately problematic on 32-bit (or 31-bit) architectures where the total virtual address space for a process is somewhere between 2GB (S/390) and almost 4GB (Linux IA-32 4GB/4GB kernel split, AMD64 running 32-bit processes, etc.). Typical installations these days contain thousands of shared libraries and if each of them is given a unique address space slot, on average executables will have pretty sparse mapping of its shared libraries and there will be less contiguous virtual memory for application’s own use. [ Especially databases look these days for every byte of virtual address space on 32-bit architectures. ] Prelink has a special mode, turned on with -m option, in which it computes what shared libraries are ever loaded together in some executable (not considering dlopen). If two shared libraries are ever loaded together, prelink assigns them different virtual address space slots, but if they never appear together, it can give them overlapping addresses. For example applications using KDE toolkit link typically against many KDE shared libraries, programs written using the Gtk+ toolkit link typically against many Gtk+ shared libraries, but there are just very few programs which link against both KDE and Gtk+ shared libraries, and even if they do, they link against very small subset of those shared libraries. So all KDE shared libraries not in that subset can use overlapping addresses with all Gtk+ shared libraries but the few exceptions. This leads to considerably smaller virtual address space range used by all prelinked shared libraries, but it has its own disadvantages too. It doesn’t work too well with incremental prelinking, because then not all executables are investigated, just those which are given on prelink’s command line. Prelink also considers executables in prelink.cache, but it has no information about executables which have not been prelinked yet. If a new executable, which links against some shared libraries which never appeared together before, is prelinked later, prelink has to assign them new, non-overlapping addresses. This means that any executables, which linked against the library that has been moved and re-prelinked, need to be prelinked again. If this happened during incremental prelinking, prelink will fix up only the executables given on the command line, leaving other executables untouched. The untouched executables would not be able to benefit from prelinking anymore. Although with the above two layout schemes shared library addresses can vary slightly between different hosts running the same distribution (depending on the exact set of installed executables and libraries), especially the most often used shared libraries will have identical base addresses on different computers. This is often not desirable for security reasons, because it makes it slightly easier for various exploits to jump to routines they want. Standard Linux kernels assign always the same addresses to shared libraries loaded by the application at each run, so with these kernels prelink doesn’t make things worse. But there are kernel patches, such as Red Hat’s Exec-Shield, which randomize memory mappings on each run. If shared libraries are prelinked, they cannot be assigned different addresses on each run (prelinking information can be only used to speed up startup if they are mapped at the base addresses which was used during prelinking), which means prelinking might not be desirable on some edge servers. Prelink can assign different addresses on different hosts though, which is almost the same as assigning random addresses on each run for long running processes such as daemons. Furthermore, the administrator can force full prelinking and assignment of new random addresses every few days (if he is also willing to restart the services, so that the old shared libraries and executables don’t have to be kept in memory). To assign random addresses prelink has the -R option. This causes a random starting address somewhere in the architecture specific range in which shared libraries are assigned, and minor random reshuffling in the queue of shared libraries which need address assignment (normally it is sorted by descending usage count, with randomization shared libraries which are not very far away from each other in the sorted list can be swapped). The -R option should work orthogonally to the -m option. Some architectures have special further requirements on shared library address assignment. On 32-bit PowerPC, if shared libraries are located close to the executable, so that everything fits into 32MB area, PLT slots resolving to those shared libraries can use the branch relative instruction instead of more expensive sequences involving memory load and indirect branch. If shared libraries are located in the first 32MB of address space, PLT slots resolving to those shared libraries can use the branch absolute instruction (but already PLT slots in those shared libraries resolving to addresses in the executable cannot be done cheaply). This means for optimization prelink should assign addresses from a 24MB region below the executable first, assuming most of the executables are smaller than those remaining 8MB. prelink assigns these from higher to lower addresses. When this region is full, prelink starts from address 0x40000 up till the bottom of the first area. [ To leave some pages unmapped to catch NULL pointer dereferences. ] Only when all these areas are full, prelink starts picking addresses high above the executable, so that sufficient space is left in between to leave room for brk. When -R option is specified, prelink needs to honor it, but in a way which doesn’t totally kill this optimization. So it picks up a random start base within each of the 3 regions separately, splitting them into 6 regions. Another architecture which needs to be handled specially is IA-32 when using Exec-Shield. The IA-32 architecture doesn’t have a bit to disable execution for each page, only for each segment. All readable pages are normally executable: This means the stack is usually executable, as is memory allocated by malloc. This is undesirable for security reasons, exploits can then overflow a buffer on the stack to transfer control to code it creates on the stack. Only very few programs actually need an executable stack. For example programs using GCC trampolines for nested functions need it or when an application itself creates executable code on the stack and calls it. Exec-Shield works around this IA32 architecture deficiency by using a separate code segment, which starts at address 0 and spans address space until its limit, highest page which needs to be executable. This is dynamically changed when some page with higher address than the limit needs to be executable (either because of mmap with PROT_EXEC bit set, or mprotect with PROT_EXEC of an existing mapping). This kind of protection is of course only effective if the limit is as low as possible. The kernel tries to put all new mappings with PROT_EXEC set and NULL address low. If possible into ASCII Shield area (first 16MB of address space), if not, at least below the executable. If prelink detects Exec-Shield, it tries to do the same as kernel when assigning addresses, i.e. prefers to assign addresses in ASCII Shield area and continues with other addresses below the program. It needs to leave first 1MB plus 4KB of address space unallocated though, because that range is often used by programs using vm86 system call. ============================================================================ 6 Relocation of libraries When a shared library has a base address assigned, it needs to be relocated so that the base address is equal to the first PT_LOAD segment’s p_vaddr. The effect of this operation should be bitwise identical as if the library were linked with that base address originally. That is, the following scripts should produce identical output: $ gcc -g -shared -o libfoo.so.1.0.0 -Wl,-h,libfoo.so.1 \ input1.o input2.o somelib.a $ prelink --reloc-only=0x54321000 libfoo.so.1.0.0 and: $ gcc -shared -Wl,--verbose 2>&1 > /dev/null \ | sed -e ’/^======/,/^======/!d’ \ -e ’/^======/d;s/0\( + SIZEOF_HEADERS\)/0x54321000\1/’ \ > libfoo.so.lds $ gcc -Wl,-T,libfoo.so.lds -g -shared -o libfoo.so.1.0.0 \ -Wl,-h,libfoo.so.1 input1.o input2.o somelib.a The first script creates a normal shared library with the default base address 0 and then uses prelink’s special mode when it just relocates a library to a given address. The second script first modifies a built-in GNU linker script for linking of shared libraries, so that the base address is the one given instead of zero and stores it into a temporary file. Then it creates a shared library using that linkerscript. The relocation operation involves mostly adding the difference between old and new base address to all ELF fields which contain values representing virtual addresses of the shared library (or in the program header table also representing physical addresses). File offsets need to be unmodified. Most places where the adjustments need to be done are clear, prelink just has to watch ELF spec to see which fields contain virtual addresses. One problem is with absolute symbols. Prelink has no way to find out if an absolute symbol in a shared library is really meant as absolute and thus not changing during relocation, or if it is an address of some place in the shared library outside of any section or on their edge. For instance symbols created in the GNU linker’s script outside of section directives have all SHN_ABS section, yet they can be location in the library (e.g. symbolfoo = .) or they can be absolute (e.g. symbolbar = 0x12345000). This distinction is lost at link time. But the dynamic linker when looking up symbols doesn’t make any distinction between them, all addresses during dynamic lookup have the load offset added to it. Prelink chooses to relocate any absolute symbols with value bigger than zero, that way prelink –reloc-only gets bitwise identical output with linking directly at the different base in almost all real-world cases. Thread Local Storage symbols (those with STT_TLS type) are never relocated, as their values are relative to start of shared library’s thread local area. When relocating the dynamic section there are no bits which tell if a particular dynamic tag uses d_un.d_ptr (which needs to be adjusted) or d_un.d_val (which needs to be left as is), so prelink has to hardcode a list of well known architecture independent dynamic tags which need adjusting and have a hook for architecture specific dynamic tag adjustment. Sun came up with DT_ADDRRNGLO to DT_ADDRRNGHI and DT_VALRNGLO to DT_VALRNGHI dynamic tag number ranges, so at least as long as these ranges are used for new dynamic tags prelink can relocate correctly even without listing them all explicitly. When relocating .rela.* or .rel.* sections, which is done in architecture specific code, relative relocations and on .got.plt using architectures also PLT relocations typically need an adjustment. The adjustment needs to be done in either r_addend field of the ElfNN_Rela structure, in the memory pointed by r_offset, or in both locations. On some architectures what needs adjusting is not even the same for all relative relocations. Relative relocations against some sections need to have r_addend adjusted while others need to have memory adjusted. On many architectures, first few words in GOT are special and some of them need adjustment. The hardest part of the adjustment is handling the debugging sections. These are non-allocated sections which typically have no corresponding relocation section associated with them. Prelink has to match the various debuggers in what fields it adjusts and what are skipped. As of this writing prelink should handle DWARF 2 [15] standard as corrected (and extended) by DWARF 3 draft [16], Stabs [17] with GCC extensions and Alpha or MIPS Mdebug. DWARF 2 debugging information involves many separate sections, each of them with a unique format which needs to be relocated differently. For relocation of the .debug_info section compilation units prelink has to parse the corresponding part of the .debug_abbrev section, adjust all values of attributes that are using the DW_FORM_addr form and adjust embedded location lists. .debug_ranges and .debug_loc section portions depend on the exact place in .debug_info section from which they are referenced, so that prelink can keep track of their base address. DWARF debugging format is very extendable, so prelink needs to be very conservative when it sees unknown extensions. It needs to fail prelinking instead of silently break debugging information if it sees an unknown .debug_* section, unknown attribute form or unknown attribute with one of the DW_FORM_block* forms, as they can potentially embed addresses which would need adjustment. For stabs prelink tried to match GDB behavior. For N_FUN, it needs to differentiate between function start and function address which are both encoded with this type, the rest of types either always need relocating or never. And similarly to DWARF 2 handling, it needs to reject unknown types. The relocation code in prelink is a little bit more generic than what is described above, as it is used also by other parts of prelink, when growing sections in a middle of the shared library during REL to RELA conversion. All adjustment functions get passed both the offset it should add to virtual addresses and a start address. Adjustment is only done if the old virtual address was bigger or equal than the start address. ============================================================================ 7 REL to RELA conversion On architectures which normally use the REL format for relocations instead of RELA (IA-32, ARM and MIPS), if certain relocation types use the memory r_offset points to during relocation, prelink has to either convert them to a different relocation type which doesn’t use the memory value, or the whole .rel.dyn section needs to be converted to RELA format. Let’s describe it on an example on IA-32 architecture: $ cat > test1.c < test2.c < test1.c < test2.c < test3.c < test1.c < test2.c < test.c < extern int i, *j, *k, *foo (void), bar (void); int main (void) { #ifdef PRINT_I printf (”%p\n”, &i); #endif printf (”%p %p %p %p\n”, j, k, foo (), bar ()); } EOF $ gcc -nostdlib -shared -fpic -s -o test1.so test1.c $ gcc -nostdlib -shared -fpic -o test2.so test2.c ./test1.so $ gcc -o test test.c ./test2.so ./test1.so $ ./test 0x16137c 0x16137c 0x16137c 0x16137c $ readelf -r ./test1.so Relocation section ’.rel.dyn’ at offset 0x2bc contains 2 entries: Offset Info Type Sym.Value Sym.Name 000012e4 00000d01 R_386_32 00001368 i 00001364 00000d06 R_386_GLOB_DAT 00001368 i $ prelink -N ./test ./test1.so ./test2.so $ LD_WARN= LD_TRACE_PRELINKING=1 LD_BIND_NOW=1 /lib/ld-linux.so.2 ./test1.so ./test1.so => ./test1.so (0x04db6000, 0x00000000) $ LD_WARN= LD_TRACE_PRELINKING=1 LD_BIND_NOW=1 /lib/ld-linux.so.2 ./test2.so ./test2.so => ./test2.so (0x04dba000, 0x00000000) ./test1.so => ./test1.so (0x04db6000, 0x00000000) $ LD_WARN= LD_TRACE_PRELINKING=1 LD_BIND_NOW=1 /lib/ld-linux.so.2 ./test \ | sed ’s/^[[:space:]]*/ /’ ./test => ./test (0x08048000, 0x00000000) ./test2.so => ./test2.so (0x04dba000, 0x00000000) ./test1.so => ./test1.so (0x04db6000, 0x00000000) libc.so.6 => /lib/tls/libc.so.6 (0x00b22000, 0x00000000) TLS(0x1, 0x00000028) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x00b0a000, 0x00000000) $ readelf -S ./test1.so | grep ’\.data\|\.got’ [ 6] .data PROGBITS 04db72e4 0002e4 000004 00 WA 0 0 4 [ 8] .got PROGBITS 04db7358 000358 000010 04 WA 0 0 4 $ readelf -r ./test1.so Relocation section ’.rel.dyn’ at offset 0x2bc contains 2 entries: Offset Info Type Sym.Value Sym. Name 04db72e4 00000d06 R_386_GLOB_DAT 04db7368 i 04db7364 00000d06 R_386_GLOB_DAT 04db7368 i $ objdump -s -j .got -j .data test1.so test1.so: file format elf32-i386 Contents of section .data: 4db72e4 6873db04 hs.. Contents of section .got: 4db7358 e8120000 00000000 00000000 6873db04 ............hs.. $ readelf -r ./test | sed ’/\.gnu\.conflict/,$!d’ Relocation section ’.gnu.conflict’ at offset 0x7ac contains 18 entries: Offset Info Type Sym.Value Sym.Name + Addend 04db72e4 00000001 R_386_32 04dbb37c 04db7364 00000001 R_386_32 04dbb37c 00c56874 00000001 R_386_32 fffffff0 00c56878 00000001 R_386_32 00000001 00c568bc 00000001 R_386_32 fffffff4 00c56900 00000001 R_386_32 ffffffec 00c56948 00000001 R_386_32 ffffffdc 00c5695c 00000001 R_386_32 ffffffe0 00c56980 00000001 R_386_32 fffffff8 00c56988 00000001 R_386_32 ffffffe4 00c569a4 00000001 R_386_32 ffffffd8 00c569c4 00000001 R_386_32 ffffffe8 00c569d8 00000001 R_386_32 080485b8 00b1f510 00000007 R_386_JUMP_SLOT 00b91460 00b1f514 00000007 R_386_JUMP_SLOT 00b91080 00b1f518 00000007 R_386_JUMP_SLOT 00b91750 00b1f51c 00000007 R_386_JUMP_SLOT 00b912c0 00b1f520 00000007 R_386_JUMP_SLOT 00b91200 $ ./test 0x4dbb37c 0x4dbb37c 0x4dbb37c 0x4dbb37c Conflict example In the example, among some conflicts caused by the dynamic linker and the C library, there is a conflict for the symbol i in test1.so shared library. [ Particularly in the example, the 5 R_386_JUMP_SLOT fixups are PLT slots in the dynamic linker for memory allocator functions resolving to C library functions instead of dynamic linker’s own trivial implementation. First 10 R_386_32 fixups at offsets 0xc56874 to 0xc569c4 are Thread Local Storage fixups in the C library and the fixup at 0xc569d8 is for _IO_stdin_used weak undefined symbol in the C library, resolving to a symbol with the same name in the executable. ] test1.so has just itself in its natural symbol lookup scope (as proved by command output): LD_WARN= LD_TRACE_PRELINKING=1 LD_BIND_NOW=1 /lib/ld-linux.so.2 ./test1.so So when looking up symbol i in this scope the definition in test1.so is chosen. test1.so has two relocations against the symbol i, one R_386_32 against .data section and one R_386_GLOB_DAT against .got section. When prelinking test1.so library, the dynamic linker stores the address of i (0x4db7368) into both locations (at offsets 0x4db72e4 and 0x4db7364). The global symbol search scope in test executable contains the executable itself, test2.so and test1.so libraries, libc.so.6 and the dynamic linker in the listed order. When doing symbol lookup for symbol i in test1.so when doing relocation processing of the whole executable, address of i in test2.so is returned as that symbol comes earlier in the global search scope. So, when none of the libraries nor the executable is prelinked, the program prints 4 identical addresses. If prelink didn’t create conflict fixups for the two relocations against the symbol i in test1.so, prelinked executable (which bypasses normal relocation processing on startup) would print instead of the desired: 0x4dbb37c 0x4dbb37c 0x4dbb37c 0x4dbb37c different addresses, 0x4db7368 0x4dbb37c 0x4db7368 0x4dbb37c That is a functionality change that prelink cannot be permitted to make, so instead it fixes up the two locations by storing the desired value in there. In this case prelink really cannot avoid that - test1.so shared library could be also used without test2.so in some other executable’s symbol search scope. Or there could be some executable linked with: $ gcc -o test2 test.c ./test1.so ./test2.so Conflict example with swapped order of libraries where i lookup in test1.so and test2.so is supposed to resolve to i in test1.so. Now consider what happens if the executable is linked with -DPRINT_I: $ gcc -DPRINT_I -o test3 test.c ./test2.so ./test1.so $ ./test3 0x804972c 0x804972c 0x804972c 0x804972c 0x804972c $ prelink -N ./test3 ./test1.so ./test2.so $ readelf -S ./test2.so | grep ’\.data\|\.got’ [ 6] .data PROGBITS 04dbb2f0 0002f0 000004 00 WA 0 0 4 [ 8] .got PROGBITS 04dbb36c 00036c 000010 04 WA 0 0 4 $ readelf -r ./test2.so Relocation section ’.rel.dyn’ at offset 0x2c8 contains 2 entries: Offset Info Type Sym.Value Sym.Name 04dbb2f0 00000d06 R_386_GLOB_DAT 04dbb37c i 04dbb378 00000d06 R_386_GLOB_DAT 04dbb37c i $ objdump -s -j .got -j .data test2.so test2.so: file format elf32-i386 Contents of section .data: 4dbb2f0 7cb3db04 |... Contents of section .got: 4dbb36c f4120000 00000000 00000000 7cb3db04 ............|... $ readelf -r ./test3 Relocation section ’.rel.dyn’ at offset 0x370 contains 4 entries: Offset Info Type Sym.Value Sym.Name 08049720 00000e06 R_386_GLOB_DAT 00000000 __gmon_start__ 08049724 00000105 R_386_COPY 08049724 j 08049728 00000305 R_386_COPY 08049728 k 0804972c 00000405 R_386_COPY 0804972c i Relocation section ’.rel.plt’ at offset 0x390 contains 4 entries: Offset Info Type Sym.Value Sym. Name 08049710 00000607 R_386_JUMP_SLOT 080483d8 __libc_start_main 08049714 00000707 R_386_JUMP_SLOT 080483e8 printf 08049718 00000807 R_386_JUMP_SLOT 080483f8 foo 0804971c 00000c07 R_386_JUMP_SLOT 08048408 bar Relocation section ’.gnu.conflict’ at offset 0x7f0 contains 20 entries: Offset Info Type Sym.Value Sym.Name + Addend 04dbb2f0 00000001 R_386_32 0804972c 04dbb378 00000001 R_386_32 0804972c 04db72e4 00000001 R_386_32 0804972c 04db7364 00000001 R_386_32 0804972c 00c56874 00000001 R_386_32 fffffff0 00c56878 00000001 R_386_32 00000001 00c568bc 00000001 R_386_32 fffffff4 00c56900 00000001 R_386_32 ffffffec 00c56948 00000001 R_386_32 ffffffdc 00c5695c 00000001 R_386_32 ffffffe0 00c56980 00000001 R_386_32 fffffff8 00c56988 00000001 R_386_32 ffffffe4 00c569a4 00000001 R_386_32 ffffffd8 00c569c4 00000001 R_386_32 ffffffe8 00c569d8 00000001 R_386_32 080485f0 00b1f510 00000007 R_386_JUMP_SLOT 00b91460 00b1f514 00000007 R_386_JUMP_SLOT 00b91080 00b1f518 00000007 R_386_JUMP_SLOT 00b91750 00b1f51c 00000007 R_386_JUMP_SLOT 00b912c0 00b1f520 00000007 R_386_JUMP_SLOT 00b91200 $ ./test3 0x804972c 0x804972c 0x804972c 0x804972c 0x804972c Conflict example with COPY relocation for conflicting symbol Because the executable is not compiled as position independent code and main function takes address of i variable, the object file for test3.c contains a R_386_32 relocation against i. The linker cannot make dynamic relocations against read-only segment in the executable, so the address of i must be constant. This is accomplished by creating a new object i in the executable’s .dynbss section and creating a dynamic R_386_COPY relocation for it. The relocation ensures that during startup the content of i object earliest in the search scope without the executable is copied to this i object in executable. Now, unlike test executable, in test3 executable i lookups in both test1.so and test2.so libraries result in address of i in the executable (instead of test2.so). This means that two conflict fixups are needed again for test1.so (but storing 0x804972c instead of 0x4dbb37c) and two new fixups are needed for test2.so. If the executable is compiled as position independent code, $ gcc -fpic-DPRINT_I -o test4 test.c ./test2.so ./test1.so $ ./test4 0x4dbb37c 0x4dbb37c 0x4dbb37c 0x4dbb37c 0x4dbb37c Conflict example with position independent code in the executable The address of i is stored in executable’s .got section, which is writable and thus can have dynamic relocation against it. So the linker creates a R_386_GLOB_DAT relocation against the .got section, the symbol i is undefined in the executable and no copy relocations are needed. In this case, only test1.so will need 2 fixups, test2.so will not need any. There are various reasons for conflicts: • Improperly linked shared libraries. If a shared library always needs symbols from some particular shared library, it should be linked against that library, usually by adding -lLIBNAME to gcc -shared command line used during linking of the shared library. This both reduces conflict fixups in prelink and makes the library easier to load using dlopen, because applications don’t have to remember that they have to load some other library first. The best place to record the dependency is in the shared library itself. Another reason is if the needed library uses symbol versioning for its symbols. Not linking against that library can result in malfunctioning shared library. Prelink issues a warning for such libraries - Warning: library has undefined non-weak symbols. When linking a shared library, the -Wl,-z,defs option can be used to ensure there are no such undefined nonweak symbols. There are exceptions when undefined non-weak symbols in shared libraries are desirable. One exception is when there are multiple shared libraries providing the same functionality, and a shared library doesn’t care which one is used. An example can be e.g. libreadline.so.4, which needs some terminal handling functions, which are provided be either libtermcap.so.2, or libncurses.so.5. Another exception is with plugins or other shared libraries which expect some symbols to be resolved to symbols defined in the executable. • A library overriding functionality of some other library. One example is e.g. C library and POSIX thread library. Older versions of the GNU C library did not provide cancelable entry points required by the standard. This is not needed for non-threaded applications. So only the libpthread.so.0 shared library which provides POSIX threading support then overrode the cancellation entry points required by the standard by wrapper functions which provided the required functionality. Although most recent versions of the GNU C library handle cancellation even in entry points in libc.so.6 (this was needed for cases when libc.so.6 comes earlier before libpthread.so.0 in symbol search scope and used to be worked around by non-standard handling of weak symbols in the dynamic linker), because of symbol versioning the symbols had to stay in libpthread.so.0 as well as in libc.so.6. This means every program using POSIX threads on Linux will have a couple of conflict fixups because of this. • Programs which need copy relocations. Although prelink will resolve the copy relocations at prelinking time, if any shared library has relocations against the symbol which needed copy relocation, all such relocations will need conflict fixups. Generally, it is better to not export variables from shared libraries in their APIs, instead provide accessor functions. • Function pointer equality requirement for functions called from executables. When address of some global function is taken, at least C and C++ require that this pointer is the same in the whole program. Executables typically contain position dependent code, so when code in the executable takes address of some function not defined in the executable itself, that address must be link time constant. Linker accomplishes this by creating a PLT slot for the function unless there was one already and resolving to the address of PLT slot. The symbol for the function is created with st_value equal to address of the PLT slot, but st_shndx set to SHN_UNDEF. Such symbols are treated specially by the dynamic linker, in that PLT relocations resolve to first symbol in the global search scope after the executable, while symbol lookups for all other relocation types return the address of the symbol in the executable. Unfortunately, GNU linker doesn’t differentiate between taking address of a function in an executable (especially one for which no dynamic relocation is possible in case it is in read-only segment) and just calling the function, but never taking its address. If it cleared the st_value field of the SHN_UNDEF function symbols in case nothing in the executable takes the function’s address, several prelink conflict could disappear (SHN_UNDEF symbols with st_value set to 0 are treated always as real undefined symbols by the dynamic linker). • COMDAT code and data in C++. C++ language has several places where it may need to emit some code or data without a clear unique compilation unit owning it. Examples include taking address of an inline function, local static variable in inline functions, virtual tables for some classes (this depends on #pragma interface or #pragma implementation presence, presence of non-inline non-pure-virtual member function in the class, etc.), RTTI info for them. Compilers and linkers handle these using various COMDAT schemes, e.g. GNU linker’s .gnu.linkonce* special sections or using SHT_GROUP. Unfortunately, all these duplicate merging schemes work only during linking of shared libraries or executables, no duplicate removal is done across shared libraries. Shared libraries typically have relocations against their COMDAT code or data objects (otherwise they wouldn’t be at least in most cases emitted at all), so if there are COMDAT duplicates across shared libraries or the executable, they lead to conflict fixups. The linker theoretically could try to merge COMDAT duplicates across shared libraries if specifically requested by the user (if a COMDAT symbol is already present in one of the dependent shared libraries and is STB_WEAK, the linker could skip it). Unfortunately, this only works as long as the user has full control over the dependent shared libraries, because the COMDAT symbol could be exported from them just as a side effect of their implementation (e.g. they use some class internally). When such libraries are rebuilt even with minor changes in their implementation (unfortunately with C++ shared libraries it is usually not very clear what part is exported ABI and what is not), some of those COMDAT symbols in them could go away (e.g. because suddenly they use a different class internally and the previously used class is not referenced anywhere). When COMDAT objects are not merged across shared libraries, this makes no problems, as each library which needs the COMDAT has its own copy. But with COMDAT duplicate removal between shared libraries there could suddenly be unresolved references and the shared libraries would need to be relinked. The only place where this could work safely is when a single package includes several C++ shared libraries which depend on each other. They are then shipped always together and when one changes, all others need changing too. ============================================================================ 9 Prelink optimizations to reduce number of conflict fixups Prelink can optimize out some conflict fixups if it can prove that the changes are not observable by the application at runtime (opening its executable and reading it doesn’t count). If there is a data object in some shared library with a symbol that is overridden by a symbol in a different shared library earlier in global symbol lookup scope or in the executable, then that data object is likely never referenced and it shouldn’t matter what it contains. Examine the following example: $ cat > test1.c < test2.c < test.c < extern struct A { int *a; int *b; int *c; } *y, *z; int main (void) { printf (”%p: %p %p %p\n”, y, y->a, y->b, y->c); printf (”%p: %p %p %p\n”, z, z->a, z->b, z->c); } EOF $ gcc -nostdlib -shared -fpic -s -o test1.so test1.c $ gcc -nostdlib -shared -fpic -o test2.so test2.c ./test1.so $ gcc -o test test.c ./test2.so ./test1.so $ ./test 0xaf3314: 0xaf33b0 0xaf33a8 0xaf33ac 0xaf3314: 0xaf33b0 0xaf33a8 0xaf33ac C example where conflict fixups could be optimized out In this example there are 3 conflict fixups pointing into the 12 byte long x object in test1.so shared library (among other conflicts). And nothing in the program can poke at x content in test1.so, simply because it has to look at it through x symbol which resolves to test2.so. So in this case prelink could skip those 3 conflicts. Unfortunately it is not that easy: $ cat > test3.c < test4.c < extern struct A { int *a; int *b; int *c; } *y, *y2, *z; int main (void) { printf (”%p: %p %p %p\n”, y, y->a, y->b, y->c); printf (”%p: %p %p %p\n”, y2, y2->a, y2->b, y2->c); printf (”%p: %p %p %p\n”, z, z->a, z->b, z->c); } EOF $ gcc -nostdlib -shared -fpic -s -o test3.so test3.c $ gcc -nostdlib -shared -fpic -o test4.so test2.c ./test3.so $ gcc -o test4 test4.c ./test4.so ./test3.so $ ./test4 0x65a314: 0x65a3b0 0x65a3a8 0x65a3ac 0xbd1328: 0x65a3b0 0x65a3a8 0x65a3ac 0x65a314: 0x65a3b0 0x65a3a8 0x65a3ac Modified C example where conflict fixups cannot be removed In this example, there are again 3 conflict fixups pointing into the 12 byte long x object in test3.so shared library. The fact that variable local is located at the same 12 bytes is totally invisible to prelink, as local is a STB_LOCAL symbol which doesn’t show up in .dynsym section. But if those 3 conflict fixups are removed, then suddenly program’s observable behavior changes (the last 3 addresses on second line would be different than those on first or third line). Fortunately, there are at least some objects where prelink can be reasonably sure they will never be referenced through some local alias. Those are various compiler generated objects with well defined meaning which is prelink able to identify in shared libraries. The most important ones are C++ virtual tables and RTTI data. They are emitted as COMDAT data by the compiler, in GCC into .gnu.linkonce.d.* sections. Data or code in these sections can be accessed only through global symbols, otherwise linker might create unexpected results when two or more of these sections are merged together (all but one deleted). When prelink is checking for such data, it first checks whether the shared library in question is linked against libstdc++.so. If not, it is not a C++ library (or incorrectly built one) and thus it makes no sense to search any further. It looks only in .data section, for STB_WEAK STT_OBJECT symbols whose names start with certain prefixes and where no other symbols (in dynamic symbol table) point into the objects. [ __vt_ for GCC 2.95.x and 2.96-RH virtual tables, _ZTV for GCC 3.x virtual tables and _ZTI for GCC 3.x RTTI data. ] If these objects are unused because there is a conflict on their symbol, all conflict fixups pointing into the virtual table or RTTI structure can be discarded. Another possible optimization is again related to C++ virtual tables. Function addresses in them are not intended for pointer comparisons. C++ code only loads them from the virtual tables and calls through the pointer. Pointers to member functions are handled differently. As pointer equivalence is the only reason why all function pointers resolve to PLT slots in the executable even when the executable doesn’t include implementation of the function (i.e. has SHN_UNDEF symbol with non-zero st_value pointing at the PLT slot in the executable), prelink can resolve method addresses in virtual tables to the actual method implementation. In many cases this is in the same library as the virtual table (or in one of libraries in its natural symbol lookup scope), so a conflict fixup is unnecessary. This optimization speeds up programs also after control is transfered to the application and not just the time to start up the application, although just a few cycles per method call. The conflict fixup reduction is quite big on some programs. Below is statistics for kmail program on completely unprelinked box: $ LD_DEBUG=statistics /usr/bin/kmail 2>&1 | sed ’2,8!d;s/^ *//’ total startup time in dynamic loader: 240724867 clock cycles time needed for relocation: 234049636 clock cycles (97.2%) number of relocations: 34854 number of relocations from cache: 74364 number of relative relocations: 35351 time needed to load objects: 6241678 clock cycles (2.5%) $ ls -l /usr/bin/kmail -rwxr-xr-x 1 root root 2149084 Oct 2 12:05 /usr/bin/kmail $ ( Xvfb :3 & ) >/dev/null 2>&1 /dev/null 2>&1 /dev/null 2>&1 &1 | sed ’2,8!d;s/^ *//’ total startup time in dynamic loader: 8409504 clock cycles time needed for relocation: 3024720 clock cycles (35.9%) number of relocations: 0 number of relocations from cache: 8961 number of relative relocations: 0 time needed to load objects: 4897336 clock cycles (58.2%) $ ls -l /usr/bin/kmail -rwxr-xr-x 1 root root 2269500 Oct 2 12:05 /usr/bin/kmail $ ( Xvfb :3 & ) >/dev/null 2>&1 /dev/null 2>&1 /dev/null 2>&1 &1 | sed ’2,8!d;s/^ *//’ total startup time in dynamic loader: 9704168 clock cycles time needed for relocation: 4734715 clock cycles (48.7%) number of relocations: 0 number of relocations from cache: 59871 number of relative relocations: 0 time needed to load objects: 4487971 clock cycles (46.2%) $ ls -l /usr/bin/kmail -rwxr-xr-x 1 root root 2877360 Oct 2 12:05 /usr/bin/kmail $ ( Xvfb :3 & ) >/dev/null 2>&1 /dev/null 2>&1 /dev/null 2>&1 test1.c <&1 \ | sed ’/^===/,/^===/!d;/^===/d;s/\.rel\.dyn/. += 512; &/’ > test1.lds $ gcc -s -O2 -o test1 test1.c -Wl,-T,test1.lds $ readelf -Sl ./test1 | sed -e ”$SEDCMD” -e ”$SEDCMD2” [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 08048114 000114 000013 00 A 0 0 1 [ 2] .note.ABI-tag NOTE 08048128 000128 000020 00 A 0 0 4 [ 3] .hash HASH 08048148 000148 000024 04 A 4 0 4 [ 4] .dynsym DYNSYM 0804816c 00016c 000040 10 A 5 1 4 [ 5] .dynstr STRTAB 080481ac 0001ac 000045 00 A 0 0 1 [ 6] .gnu.version VERSYM 080481f2 0001f2 000008 02 A 4 0 2 [ 7] .gnu.version_r VERNEED 080481fc 0001fc 000020 00 A 5 1 4 [ 8] .rel.dyn REL 0804841c 00041c 000008 08 A 4 0 4 [ 9] .rel.plt REL 08048424 000424 000008 08 A 4 b 4 [10] .init PROGBITS 0804842c 00042c 000017 00 AX 0 0 4 ... [22] .bss NOBITS 080496f8 0006f8 000004 00 WA 0 0 4 [23] .comment PROGBITS 00000000 0006f8 000132 00 0 0 1 [24] .shstrtab STRTAB 00000000 00082a 0000be 00 0 0 1 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08048034 0x08048034 0x000e0 0x000e0 R E 0x4 INTERP 0x000114 0x08048114 0x08048114 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08048000 0x08048000 0x005fc 0x005fc R E 0x1000 LOAD 0x0005fc 0x080495fc 0x080495fc 0x000fc 0x00100 RW 0x1000 DYNAMIC 0x000608 0x08049608 0x08049608 0x000c8 0x000c8 RW 0x4 NOTE 0x000128 0x08048128 0x08048128 0x00020 0x00020 R 0x4 STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 $ prelink -N ./test1 $ readelf -Sl ./test1 | sed -e ”$SEDCMD” -e ”$SEDCMD2” [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 08048114 000114 000013 00 A 0 0 1 [ 2] .note.ABI-tag NOTE 08048128 000128 000020 00 A 0 0 4 [ 3] .hash HASH 08048148 000148 000024 04 A 4 0 4 [ 4] .dynsym DYNSYM 0804816c 00016c 000040 10 A 8 1 4 [ 5] .gnu.liblist GNU_LIBLIST 080481ac 0001ac 000028 14 A 8 0 4 [ 6] .gnu.version VERSYM 080481f2 0001f2 000008 02 A 4 0 2 [ 7] .gnu.version_r VERNEED 080481fc 0001fc 000020 00 A 8 1 4 [ 8] .dynstr STRTAB 0804821c 00021c 000058 00 A 0 0 1 [ 9] .gnu.conflict RELA 08048274 000274 0000c0 0c A 4 0 4 [10] .rel.dyn REL 0804841c 00041c 000008 08 A 4 0 4 [11] .rel.plt REL 08048424 000424 000008 08 A 4 d 4 [12] .init PROGBITS 0804842c 00042c 000017 00 AX 0 0 4 ... [24] .bss NOBITS 080496f8 0006f8 000004 00 WA 0 0 4 [25] .comment PROGBITS 00000000 0006f8 000132 00 0 0 1 [26] .gnu.prelink_undo PROGBITS 00000000 00082c 0004d4 01 0 0 4 [27] .shstrtab STRTAB 00000000 000d00 0000eb 00 0 0 1 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08048034 0x08048034 0x000e0 0x000e0 R E 0x4 INTERP 0x000114 0x08048114 0x08048114 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08048000 0x08048000 0x005fc 0x005fc R E 0x1000 LOAD 0x0005fc 0x080495fc 0x080495fc 0x000fc 0x00100 RW 0x1000 DYNAMIC 0x000608 0x08049608 0x08049608 0x000c8 0x000c8 RW 0x4 NOTE 0x000128 0x08048128 0x08048128 0x00020 0x00020 R 0x4 STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 Reshuffling of an executable with a gap between sections Figure 4: Reshuffling of an executable with a gap between sections In the above sample, there was enough space between sections (particularly between the end of the .gnu.version_r section and the start of .rel.dyn) that the new sections could be added there. $ SEDCMD=’s/^.* \.plt.*$/.../;/\[.*\.text/,/\[.*\.got/d’ $ SEDCMD2=’/Section to Segment/,$d;/^Key to/,/^Program/d;/^[A-Z]/d;/^ *$/d’ $ cat > test2.c < test3.c < test4.c <&1 \ | sed ’/^===/,/^===/!d;/^===/d;s/0x08048000/0x08000000/’ > test4.lds $ gcc -s -O2 -o test4 test4.c -Wl,-T,test4.lds $ readelf -Sl ./test4 | sed -e ”$SEDCMD” -e ”$SEDCMD2” [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 08000114 000114 000013 00 A 0 0 1 [ 2] .note.ABI-tag NOTE 08000128 000128 000020 00 A 0 0 4 [ 3] .hash HASH 08000148 000148 000024 04 A 4 0 4 [ 4] .dynsym DYNSYM 0800016c 00016c 000040 10 A 5 1 4 [ 5] .dynstr STRTAB 080001ac 0001ac 000045 00 A 0 0 1 [ 6] .gnu.version VERSYM 080001f2 0001f2 000008 02 A 4 0 2 [ 7] .gnu.version_r VERNEED 080001fc 0001fc 000020 00 A 5 1 4 [ 8] .rel.dyn REL 0800021c 00021c 000008 08 A 4 0 4 [ 9] .rel.plt REL 08000224 000224 000008 08 A 4 b 4 [10] .init PROGBITS 0800022c 00022c 000017 00 AX 0 0 4 ... [22] .bss NOBITS 08001500 000500 004020 00 WA 0 0 32 [23] .comment PROGBITS 00000000 000500 000132 00 0 0 1 [24] .shstrtab STRTAB 00000000 000632 0000be 00 0 0 1 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08000034 0x08000034 0x000e0 0x000e0 R E 0x4 INTERP 0x000114 0x08000114 0x08000114 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08000000 0x08000000 0x003fc 0x003fc R E 0x1000 LOAD 0x0003fc 0x080013fc 0x080013fc 0x000fc 0x04124 RW 0x1000 DYNAMIC 0x000408 0x08001408 0x08001408 0x000c8 0x000c8 RW 0x4 NOTE 0x000128 0x08000128 0x08000128 0x00020 0x00020 R 0x4 STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 $ prelink -N ./test4 $ readelf -Sl ./test4 | sed -e ”$SEDCMD” -e ”$SEDCMD2” [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 08000134 000134 000013 00 A 0 0 1 [ 2] .note.ABI-tag NOTE 08000148 000148 000020 00 A 0 0 4 [ 3] .hash HASH 08000168 000168 000024 04 A 4 0 4 [ 4] .dynsym DYNSYM 0800018c 00018c 000040 10 A 22 1 4 [ 5] .gnu.version VERSYM 080001f2 0001f2 000008 02 A 4 0 2 [ 6] .gnu.version_r VERNEED 080001fc 0001fc 000020 00 A 22 1 4 [ 7] .rel.dyn REL 0800021c 00021c 000008 08 A 4 0 4 [ 8] .rel.plt REL 08000224 000224 000008 08 A 4 a 4 [ 9] .init PROGBITS 0800022c 00022c 000017 00 AX 0 0 4 ... [21] .bss NOBITS 08001500 0004f8 004020 00 WA 0 0 32 [22] .dynstr STRTAB 080064f8 0004f8 000058 00 A 0 0 1 [23] .gnu.liblist GNU_LIBLIST 08006550 000550 000028 14 A 22 0 4 [24] .gnu.conflict RELA 08006578 000578 0000c0 0c A 4 0 4 [25] .comment PROGBITS 00000000 000638 000132 00 0 0 1 [26] .gnu.prelink_undo PROGBITS 00000000 00076c 0004d4 01 0 0 4 [27] .shstrtab STRTAB 00000000 000c40 0000eb 00 0 0 1 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08000034 0x08000034 0x000e0 0x000e0 R E 0x4 INTERP 0x000134 0x08000134 0x08000134 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08000000 0x08000000 0x003fc 0x003fc R E 0x1000 LOAD 0x0003fc 0x080013fc 0x080013fc 0x000fc 0x04124 RW 0x1000 LOAD 0x0004f8 0x080064f8 0x080064f8 0x00140 0x00140 RW 0x1000 DYNAMIC 0x000408 0x08001408 0x08001408 0x000c8 0x000c8 RW 0x4 NOTE 0x000148 0x08000148 0x08000148 0x00020 0x00020 R 0x4 STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 Reshuffling of an executable with addition of a new segment Figure 7: Reshuffling of an executable with addition of a new segment In the last example, base address was not decreased but instead a new PT_LOAD segment has been added. R__COPY relocations are typically against first part of the SHT_NOBITS .bss section. So that prelink can apply them, it needs to first change their section to SHT_PROGBITS, but as .bss section typically occupies much larger part of memory, it is not desirable to convert .bss section into SHT_PROGBITS as whole. A section cannot be partly SHT_PROGBITS and partly SHT_NOBITS, so prelink first splits the section into two parts, first .dynbss which covers area from the start of .bss section up to highest byte to which some COPY relocation is applied and then the old .bss. The first is converted to SHT_PROGBITS and its size is decreased, the latter stays SHT_NOBITS and its start address and file offset are adjusted as well as its size decreased. The dynamic linker handles relocations in the executable last, so prelink cannot just copy memory from the shared library where the symbol of the COPY relocation has been looked up in. There might be relocations applied by the dynamic linker in normal relocation processing to the objects, so prelink has to first process the relocations against that memory area. Relocations which don’t need conflict fixups are already applied, so prelink just needs to apply conflict fixups against the memory area, then copy it to the newly created .dynbss section. Here is an example which shows various things which COPY relocation handling in prelink needs to deal with: $ cat > test1.c < test.c < struct A { char a; struct A *b; int *c; int *d; }; int bar, *addr (void), big[8192]; extern struct A foo; int main (void) { printf (”%p: %d %p %p %p %p %p\n”, &foo, foo.a, foo.b, foo.c, foo.d, &bar, addr ()); } EOF $ gcc -nostdlib -shared -fpic -s -o test1.so test1.c $ gcc -s -o test test.c ./test1.so $ ./test 0x80496c0: 1 0x80496c0 0x80516e0 0x4833a4 0x80516e0 0x4833a4 $ readelf -r test | sed ’/\.rel\.dyn/,/\.rel\.plt/!d;/^0/!d’ 080496ac 00000c06 R_386_GLOB_DAT 00000000 __gmon_start__ 080496c0 00000605 R_386_COPY 080496c0 foo $ readelf -S test | grep bss [22] .bss NOBITS 080496c0 0006c0 008024 00 WA 0 0 32 $ prelink -N ./test ./test1.so $ readelf -s test | grep foo 6: 080496c0 16 OBJECT GLOBAL DEFAULT 25 foo $ readelf -s test1.so | grep foo 15: 004a9314 16 OBJECT GLOBAL DEFAULT 6 foo $ readelf -r test | sed ’/.gnu.conflict/,/\.rel\.dyn/!d;/^0/!d’ 004a9318 00000001 R_386_32 080496c0 004a931c 00000001 R_386_32 080516e0 005f9874 00000001 R_386_32 fffffff0 005f9878 00000001 R_386_32 00000001 005f98bc 00000001 R_386_32 fffffff4 005f9900 00000001 R_386_32 ffffffec 005f9948 00000001 R_386_32 ffffffdc 005f995c 00000001 R_386_32 ffffffe0 005f9980 00000001 R_386_32 fffffff8 005f9988 00000001 R_386_32 ffffffe4 005f99a4 00000001 R_386_32 ffffffd8 005f99c4 00000001 R_386_32 ffffffe8 005f99d8 00000001 R_386_32 08048584 004c2510 00000007 R_386_JUMP_SLOT 00534460 004c2514 00000007 R_386_JUMP_SLOT 00534080 004c2518 00000007 R_386_JUMP_SLOT 00534750 004c251c 00000007 R_386_JUMP_SLOT 005342c0 004c2520 00000007 R_386_JUMP_SLOT 00534200 $ objdump -s -j .dynbss test test: file format elf32-i386 Contents of section .dynbss: 80496c0 01000000 c0960408 e0160508 a4934a00 ..............J. $ objdump -s -j .data test1.so test1.so: file format elf32-i386 Contents of section .data: 4a9314 01000000 14934a00 a8934a00 a4934a00 ......J...J...J. $ readelf -S test | grep bss [24] .dynbss PROGBITS 080496c0 0016c0 000010 00 WA 0 0 32 [25] .bss NOBITS 080496d0 0016d0 008014 00 WA 0 0 32 $ sed ’s/8192/1/’ test.c > test2.c $ gcc -s -o test2 test2.c ./test1.so $ readelf -S test2 | grep bss [22] .bss NOBITS 080496b0 0006b0 00001c 00 WA 0 0 8 $ prelink -N ./test2 ./test1.so $ readelf -S test2 | grep bss [22] .dynbss PROGBITS 080496b0 0006b0 000010 00 WA 0 0 8 [23] .bss PROGBITS 080496c0 0006c0 00000c 00 WA 0 0 8 Relocation handling of .dynbss objects Because test.c executable is not compiled as position independent code and takes address of foo variable, a COPY relocation is needed to avoid dynamic relocation against executable’s read-only PT_LOAD segment. The foo object in test1.so has one field with no relocations applied at all, one relocation against the variable itself, one relocation which needs a conflict fixup (as it is overridden by the variable in the executable) and one with relocation which doesn’t need any fixups. The first and last field contain already the right values in prelinked test1.so, while second and third one need to be changed for symbol addresses in the executable (as shown in the objdump output). The conflict fixups against foo in test1.so need to stay (unless it is a C++ virtual table or RTTI data, i.e. not in this testcase). In test, prelink changed .dynbss to SHT_PROGBITS and kept SHT_NOBITS .bss, while in slightly modified testcase (test2) the size of .bss was small enough that prelink chose to make it SHT_PROGBITS too and grow the read-write PT_LOAD segment and put .dynstr and .gnu.conflict sections after it. ============================================================================ 12 Prelink undo operation Prelinking of shared libraries and executables is designed to be reversible, so that prelink operation followed by undo operation generates bitwise identical file to the original before prelinking. For this operation prelink stores the original ELF header, all the program and all section headers into a .gnu.prelink_undo section before it starts prelinking an unprelinked executable or shared library. When undoing the modifications, prelink has to convert RELA back to REL first if REL to RELA conversion was done during prelinking and all allocated sections above it relocated down to adjust for the section shrink. Relocation types which were changed when trying to avoid REL to RELA conversion need to be changed back (e.g. on IA-32, it is assumed R_386_GLOB_DAT relocations should be only those against .got section and R_386_32 relocations in the remaining places). On RELA architectures, the memory pointed by r_offset field of the relocations needs to be reinitialized to the values stored there by the linker originally. For prelink it doesn’t matter much what this value is (e.g. always 0, copy of r_addend, etc.), as long as it is computable from the information prelink has during undo operation. [ Such as relocation type, r_addend value, type, binding, flags or other attributes of relocation’s symbol, what section the relocation points into or the offset within section it points to.] The GNU linker had to be changed on several architectures, so that it stores there such a value, as in several places the value e.g. depended on original addend before final link (which is not available anywhere after final link time, since r_addend field could be adjusted during the final link). If second word of .got section has been modified, it needs to be reverted back to the original value (on most architectures zero). In executables, sections which were moved during prelinking need to be put back and segments added while prelinking must be removed. There are 3 different ways how an undo operation can be performed: • Undoing individual executables or shared libraries specified on the command line in place (i.e. when the undo operation is successful, the prelinked executable or library is atomically replaced with the undone object). • With -o option, only a single executable or shared library given on the command line is undone and stored to the file specified as -o option’s argument. • With -ua options, prelink builds a list of executables in paths written in its config file (plus directories and executables or libraries from command line) and all shared libraries these executables depend on. All executables and libraries in the list are then unprelinked. This option is used to unprelink the whole system. It is not perfect and needs to be worked on, since e.g. if some executable uses some shared library which no other executable links against, this executable (and shared library) is prelinked, then the executable is removed (e.g. uninstalled) but the shared library is kept, then the shared library is not unprelinked unless specifically mentioned on the command line. ============================================================================ 13 Verification of prelinked files As prelink needs to modify executables and shared libraries installed on a system, it complicates system integrity verification (e.g. rpm -V, TripWire). These systems store checksums of installed files into some database and during verification compute them again and compare to the values stored in the database. On a prelinked system most of the executables and shared libraries would be reported as modified. Prelink offers a special mode for these systems, in which it verifies that unprelinking the executable or shared library followed by immediate prelinking (with the same base address) creates bitwise identical output with the executable or shared library that’s being verified. Furthermore, depending on other prelink options, it either writes the unprelinked image to its standard output or computes MD5 or SHA1 digest from this unprelinked image. Mere undo operation to a file and checksumming it is not good enough, since an intruder could have modified e.g. conflict fixups or memory which relocations point at, changing a behavior of the program while file after unprelinking would be unmodified. During verification, both prelink executable and the dynamic linker are used, so a proper system integrity verification first checks whether prelink executable (which is statically linked for this reason) hasn’t been modified, then uses prelink –verify to verify the dynamic linker (when verificating ld.so the dynamic linker is not executed) followed by verification of other executables and libraries. Verification requires all dependencies of checked object to be unmodified since last prelinking. If some dependency has been changed or is missing, prelink will report it and return with non-zero exit status. This is because prelinking depends on their content and so if they are modified, the executable or shared library might be different to one after unprelinking followed by prelinking again. In the future, perhaps it would be possible to even verify executables or shared libraries without unmodified dependencies, under the assumption that in such case the prelink information will not be used. It would just need to verify that nothing else but the information only used when dependencies are up to date has changed between the executable or library on the filesystem and file after unprelink followed by prelink cycle. The prelink operation would need to be modified in this case, so that no information is collected from the dynamic linker, the list of dependencies is assumed to be the one stored in the executable and expect it to have identical number of conflict fixups. ============================================================================ 14 Measurements There are two areas where prelink can speed things up noticeably. The primary is certainly startup time of big GUI applications where the dynamic linker spends from 100ms up to a few seconds before giving control to the application. Another area is when lots of small programs are started up, but their execution time is rather short, so the startup time which prelink optimizes is a noticeable fraction of the total time. This is typical for shell scripting. First numbers are from lmbench benchmark, version 3.0-a3. Most of the benchmarks in lmbench suite measure kernel speed, so it doesn’t matter much whether prelink is used or not. Only in lat_proc benchmark prelink shows up visibly. This benchmark measures 3 different things: • fork proc, which is fork() followed by immediate exit(1) in the child and wait(0) in the parent. The results are (as expected) about the same between unprelinked and prelinked systems. • exec proc, i.e. fork() followed by immediate close(1) and execve() of a simple hello world program (this program is compiled and linked during the benchmark into a temporary directory and is never prelinked). The numbers are 160µs to 200µs better on prelinked systems, because there is no relocation processing needed initially in the dynamic linker and because all relative relocations in libc.so.6 can be skipped. • sh proc, i.e. fork() followed by immediate close(1) and execlp(”/bin/sh”, ”sh”, ”-c”, ”/tmp/hello”, 0). Although the hello world program is not prelinked in this case either, the shell is, so out of the 900µs to 1000µs speedup less than 200µs can be accounted on the speed up of the hello world program as in exec proc benchmark and the rest to the speedup of shell startup. First 4 rows are from running the benchmark on a fully unprelinked system, the last 4 rows on the same system, but fully prelinked. LMBENCH 3.0 SUMMARY -----------------------------------(Alpha software, do not distribute) Processor, Processes - times in microseconds - smaller is better ------------------------------------------------------------------------ Host OS Mhz null null open slct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc ---- ------------ ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- --- pork Linux 2.4.22 651 0.53 0.97 6.20 8.10 41.2 1.44 4.30 276. 1497 5403 pork Linux 2.4.22 651 0.53 0.95 6.14 7.91 37.8 1.43 4.34 274. 1486 5391 pork Linux 2.4.22 651 0.56 0.94 6.18 8.09 43.4 1.41 4.30 251. 1507 5423 pork Linux 2.4.22 651 0.53 0.94 6.12 8.09 41.0 1.43 4.40 256. 1497 5385 pork Linux 2.4.22 651 0.56 0.94 5.79 7.58 39.1 1.41 4.30 271. 1319 4460 pork Linux 2.4.22 651 0.56 0.92 5.76 7.40 38.9 1.41 4.30 253. 1304 4417 pork Linux 2.4.22 651 0.56 0.95 6.20 7.83 37.7 1.41 4.37 248. 1323 4481 pork Linux 2.4.22 651 0.56 1.01 6.04 7.77 37.9 1.43 4.32 256. 1324 4457 lmbench results without and with prelinking Below is a sample timing of a 239K long configure shell script from GCC on both unprelinked and prelinked system. Preparation step was following: $ cd; $ cvs -d :pserver:anoncvs@subversions.gnu.org:/cvsroot/gcc login # Empty password $ cvs -d :pserver:anoncvs@subversions.gnu.org:/cvsroot/gcc -z3 co \ -D20031103 gcc $ mkdir ~/gcc/obj $ cd ~/gcc/obj; $ ../configure i386-redhat-linux; make configure-gcc Preparation script for shell script tests On an unprelinked system, the results were: $ cd ~/gcc/obj/gcc $ for i in 1 2; do ./config.status --recheck > /dev/null 2>&1; done $ for i in 1 2 3 4; do time ./config.status --recheck > /dev/null 2>&1; done real 0m4.436s user 0m1.730s sys 0m1.260s real 0m4.409s user 0m1.660s sys 0m1.340s real 0m4.431s user 0m1.810s sys 0m1.300s real 0m4.432s user 0m1.670s sys 0m1.210s Shell script test results on unprelinked system and on a fully prelinked system: $ cd ~/gcc/obj/gcc $ for i in 1 2; do ./config.status --recheck > /dev/null 2>&1; done $ for i in 1 2 3 4; do time ./config.status --recheck > /dev/null 2>&1; done real 0m4.126s user 0m1.590s sys 0m1.240s real 0m4.151s user 0m1.620s sys 0m1.230s real 0m4.161s user 0m1.600s sys 0m1.190s real 0m4.122s user 0m1.570s sys 0m1.230s Shell script test results on prelinked system Now timing of a few big GUI programs. All timings were done without X server running and with DISPLAY environment variable not set (so that when control is transfered to the application, it very soon finds out there is no X server it can talk to and bail out). The measurements are done by the dynamic linker in ticks on a 651MHz dual Pentium III machine, i.e. ticks have to be divided by 651000000 to get times in seconds. Each application has been run 4 times and the results with smallest total time spent in the dynamic linker was chosen. Epiphany WWW browser and Evolution mail client were chosen as examples of Gtk+ applications (typically they use really many shared libraries, but many of them are quite small, there aren’t really many relocations nor conflict fixups and most of the libraries are written in C) and Konqueror WWW browser and KWord word processor were chosen as examples of KDE applications (typically they use slightly fewer shared libraries, though still a lot, most of the shared libraries are written in C++, have many relocations and cause many conflict fixups, especially without C++ conflict fixup optimizations in prelink). On non-prelinked system, the timings are done with lazy binding, i.e. without LD_BIND_NOW=1 set in the environment. This is because that’s how people generally run programs, on the other side it is not exact apples to apples comparison, since on prelinked system there is no lazy binding with the exception of shared libraries loaded through dlopen. So when control is passed to the application, prelinked programs should be slightly faster for a while since non-prelinked programs will have to do symbol lookups and processing relocations (and on various architectures flushing instruction caches) whenever they call some function they haven’t called before in particular shared library or in the executable. $ ldd ‘which epiphany-bin‘ | wc -l 64 $ # Unprelinked system $ LD_DEBUG=statistics epiphany-bin 2>&1 | sed ’s/^ *//’ 18960: runtime linker statistics: 18960: 18960: total startup time in dynamic loader: 67336593 clock cycles 18960: time needed for relocation: 58119983 clock cycles (86.3%) 18960: number of relocations: 6999 18960: number of relocations from cache: 4770 18960: number of relative relocations: 31494 18960: time needed to load objects: 8696104 clock cycles (12.9%) (epiphany-bin:18960): Gtk-WARNING **: cannot open display: 18960: runtime linker statistics: 18960: 18960: final number of relocations: 7692 18960: final number of relocations from cache: 4770 $ # Prelinked system $ LD_DEBUG=statistics epiphany-bin 2>&1 | sed ’s/^ *//’ 25697: runtime linker statistics: 25697: 25697: total startup time in dynamic loader: 7313721 clock cycles 25697: time needed for relocation: 565680 clock cycles (7.7%) 25697: number of relocations: 0 25697: number of relocations from cache: 1205 25697: number of relative relocations: 0 25697: time needed to load objects: 6179467 clock cycles (84.4%) (epiphany-bin:25697): Gtk-WARNING **: cannot open display: 25697: runtime linker statistics: 25697: 25697: final number of relocations: 31 25697: final number of relocations from cache: 1205 $ ldd ‘which evolution‘ | wc -l 68 $ # Unprelinked system $ LD_DEBUG=statistics evolution 2>&1 | sed ’s/^ *//’ 19042: runtime linker statistics: 19042: 19042: total startup time in dynamic loader: 54382122 clock cycles 19042: time needed for relocation: 43403190 clock cycles (79.8%) 19042: number of relocations: 3452 19042: number of relocations from cache: 2885 19042: number of relative relocations: 34957 19042: time needed to load objects: 10450142 clock cycles (19.2%) (evolution:19042): Gtk-WARNING **: cannot open display: 19042: runtime linker statistics: 19042: 19042: final number of relocations: 4075 19042: final number of relocations from cache: 2885 $ # Prelinked system $ LD_DEBUG=statistics evolution 2>&1 | sed ’s/^ *//’ 25723: runtime linker statistics: 25723: 25723: total startup time in dynamic loader: 9176140 clock cycles 25723: time needed for relocation: 203783 clock cycles (2.2%) 25723: number of relocations: 0 25723: number of relocations from cache: 525 25723: number of relative relocations: 0 25723: time needed to load objects: 8405157 clock cycles (91.5%) (evolution:25723): Gtk-WARNING **: cannot open display: 25723: runtime linker statistics: 25723: 25723: final number of relocations: 31 25723: final number of relocations from cache: 525 $ ldd ‘which konqueror‘ | wc -l 37 $ # Unprelinked system $ LD_DEBUG=statistics konqueror 2>&1 | sed ’s/^ *//’ 18979: runtime linker statistics: 18979: 18979: total startup time in dynamic loader: 131985703 clock cycles 18979: time needed for relocation: 127341077 clock cycles (96.4%) 18979: number of relocations: 25473 18979: number of relocations from cache: 53594 18979: number of relative relocations: 31171 18979: time needed to load objects: 4318803 clock cycles (3.2%) konqueror: cannot connect to X server 18979: runtime linker statistics: 18979: 18979: final number of relocations: 25759 18979: final number of relocations from cache: 53594 $ # Prelinked system $ LD_DEBUG=statistics konqueror 2>&1 | sed ’s/^ *//’ 25733: runtime linker statistics: 25733: 25733: total startup time in dynamic loader: 5533696 clock cycles 25733: time needed for relocation: 1941489 clock cycles (35.0%) 25733: number of relocations: 0 25733: number of relocations from cache: 2066 25733: number of relative relocations: 0 25733: time needed to load objects: 3217736 clock cycles (58.1%) konqueror: cannot connect to X server 25733: runtime linker statistics: 25733: 25733: final number of relocations: 0 25733: final number of relocations from cache: 2066 $ ldd ‘which kword‘ | wc -l 40 $ # Unprelinked system $ LD_DEBUG=statistics kword 2>&1 | sed ’s/^ *//’ 19065: runtime linker statistics: 19065: 19065: total startup time in dynamic loader: 153684591 clock cycles 19065: time needed for relocation: 148255294 clock cycles (96.4%) 19065: number of relocations: 26231 19065: number of relocations from cache: 55833 19065: number of relative relocations: 30660 19065: time needed to load objects: 5068746 clock cycles (3.2%) kword: cannot connect to X server 19065: runtime linker statistics: 19065: 19065: final number of relocations: 26528 19065: final number of relocations from cache: 55833 $ # Prelinked system $ LD_DEBUG=statistics kword 2>&1 | sed ’s/^ *//’ 25749: runtime linker statistics: 25749: 25749: total startup time in dynamic loader: 6516635 clock cycles 25749: time needed for relocation: 2106856 clock cycles (32.3%) 25749: number of relocations: 0 25749: number of relocations from cache: 2130 25749: number of relative relocations: 0 25749: time needed to load objects: 4008585 clock cycles (61.5%) kword: cannot connect to X server 25749: runtime linker statistics: 25749: 25749: final number of relocations: 0 25749: final number of relocations from cache: 2130 Dynamic linker statistics for unprelinked and prelinked GUI programs In the case of above mentioned Gtk+ applications, the original startup time spent in the dynamic linker decreased into 11% to 17% of the original times, with KDE applications it decreased even into around 4.2% of original times. The startup time reported by the dynamic linker is only part of the total startup time of a GUI program. Unfortunately it cannot be measured very accurately without patching each application separately, so that it would print current process CPU time at the point when all windows are painted and the process starts waiting for user input. The following table contains values reported by time(1) command on each of the 4 GUI programs running under X, both on unprelinked and fully prelinked system. As soon as each program painted its windows, it was killed by application’s quit hot key. [ Ctrl+W for Epiphany, Ctrl+Q for Evolution and Konqueror and Enter in Kword’s document type choice dialog. ] Especially the real time values depend also on the speed of human reactions, so each measurement was repeated 10 times. All timings were done with hot caches, after running the applications two times before measurement. Table 1: GUI program start up times without and with prelinking Type | Values (in seconds) |Mean | std | | | dev -----+-----------------------------------------------------------+-----+------ unprelinked epiphany | µ | s -----+-----------------------------------------------------------+-----+------ real |3.053 |2.84 |3.00|2.901|3.019|2.929|2.883|2.975|2.922|3.026|2.954|0.0698 user |2.33 |2.31 |2.28|2.32 |2.44 |2.37 |2.29 |2.35 |2.34 |2.41 |2.344|0.0508 sys |0.2 |0.23 |0.23|0.19 |0.19 |0.12 |0.25 |0.16 |0.14 |0.14 |0.185|0.0440 -----+-----------------------------------------------------------+-----+------ prelinked epiphany | µ | s -----+-----------------------------------------------------------+-----+------ real |2.773|2.743|2.833|2.753|2.753|2.644|2.717|2.897|2.68 |2.761|2.755|0.0716 user |2.18 |2.17 |2.17 |2.12 |2.23 |2.26 |2.13 |2.17 |2.15 |2.15 |2.173|0.0430 sys |0.13 |0.15 |0.18 |0.15 |0.11 |0.04 |0.18 |0.14 |0.1 |0.15 |0.133|0.0416 -----+-----------------------------------------------------------+-----+------ unprelinked evolution | µ | s -----+-----------------------------------------------------------+-----+------ real |2.106|1.886|1.828|2.12 |1.867|1.871|2.242|1.871|1.862|2.241|1.989|0.1679 user |1.12 |1.09 |1.15 |1.19 |1.17 |1.23 |1.15 |1.11 |1.17 |1.14 |1.152|0.0408 sys |0.1 |0.11 |0.13 |0.07 |0.1 |0.05 |0.11 |0.11 |0.09 |0.08 |0.095|0.0232 -----+-----------------------------------------------------------+-----+------ prelinked evolution | µ | s -----+-----------------------------------------------------------+-----+------ real |1.684|1.621|1.686|1.72 |1.694|1.691|1.631|1.697|1.668|1.535|1.663|0.0541 user |0.92 |0.87 |0.92 |0.95 |0.79 |0.86 |0.94 |0.87 |0.89 |0.86 |0.887|0.0476 sys |0.06 |0.1 |0.06 |0.05 |0.11 |0.08 |0.07 |0.1 |0.12 |0.07 |0.082|0.0239 -----+-----------------------------------------------------------+-----+------ unprelinked kword | µ | s -----+-----------------------------------------------------------+-----+------ real |2.111|1.414|1.36 |1.356|1.259|1.383|1.28 |1.321|1.252|1.407|1.414|0.2517 user |1.04 |0.9 |0.93 |0.88 |0.89 |0.89 |0.87 |0.89 |0.9 |0.8 |0.899|0.0597 sys |0.07 |0.04 |0.06 |0.05 |0.06 |0.1 |0.09 |0.08 |0.08 |0.12 |0.075|0.0242 -----+-----------------------------------------------------------+-----+------ prelinked kword | µ | s -----+-----------------------------------------------------------+-----+------ real |1.59 |1.052|0.972|1.064|1.106|1.087|1.066|1.087|1.065|1.005|1.109|0.1735 user |0.61 |0.53 |0.58 |0.6 |0.6 |0.58 |0.59 |0.61 |0.57 |0.6 |0.587|0.0241 sys |0.08 |0.08 |0.06 |0.06 |0.03 |0.07 |0.06 |0.03 |0.06 |0.04 |0.057|0.0183 -----+-----------------------------------------------------------+-----+------ unprelinked konqueror | µ | s -----+-----------------------------------------------------------+-----+------ real |1.306|1.386|1.27 |1.243|1.227|1.286|1.262|1.322|1.345|1.332|1.298|0.0495 user |0.88 |0.86 |0.88 |0.9 |0.87 |0.83 |0.83 |0.86 |0.86 |0.89 |0.866|0.0232 sys |0.07 |0.11 |0.12 |0.1 |0.12 |0.08 |0.13 |0.12 |0.09 |0.08 |0.102|0.0210 -----+-----------------------------------------------------------+-----+------ prelinked konqueror | µ | s -----+-----------------------------------------------------------+-----+------ real |1.056|0.962|0.961|0.906|0.927|0.923|0.933|0.958|0.955|1.142|0.972|0.0722 user |0.56 |0.6 |0.56 |0.52 |0.57 |0.58 |0.5 |0.57 |0.61 |0.55 |0.562|0.0334 sys |0.1 |0.13 |0.08 |0.15 |0.07 |0.09 |0.09 |0.09 |0.1 |0.08 |0.098|0.0244 -----+-----------------------------------------------------------+-----+------ OpenOffice.org is probably the largest program these days in Linux, mostly written in C++. In OpenOffice.org 1.1, the main executable, soffice.bin, links directly against 34 shared libraries, but typically during startup it loads using dlopen many others. As has been mentioned earlier, prelink cannot speed up loading shared libraries using dlopen, since it cannot predict in which order and what shared libraries will be loaded (and thus cannot compute conflict fixups). The soffice.bin is typically started through a wrapper script and depending on what arguments are passed to it, different OpenOffice.org application is started. With no options, it starts just empty window with menu from which the applications can be started, with say private:factory/swriter argument it starts a word processor, with private:factory/scalc it starts a spreadsheet etc. When soffice.bin is already running, if you start another copy of it, it just instructs the already running copy to pop up a new window and exits. In an experiment, soffice.bin has been invoked 7 times against running X server with: no arguments, private:factory/swriter, private:factory/scalc, private:factory/sdraw, private:factory/simpress, and private:factory/smath arguments (in all these cases nothing was pressed at all) and last with the private:factory/swriter argument where the menu item New Presentation was selected and the word processor window closed. In all these cases, /proc/‘pidof soffice.bin‘/maps file was captured and the application then killed. This file contains among other things list of all shared libraries mmapped by the process at the point where it started waiting for user input after loading up. These lists were then summarized, to get number of the runs in which particular shared library was loaded up out of the total 7 runs. There were 38 shared libraries shipped as part of OpenOffice.org package which have been loaded in all 7 times, another 3 shared libraries included in OpenOffice.org (and also one shared library shipped in another package, libdb_cxx-4.1.so) which were loaded 6 times. [ In all runs but when ran without arguments. But when the application is started without any arguments, it cannot do any useful work, so one loads one of the applications afterward anyway. ] There was one shared library loaded in 5 runs, but was locale specific and thus not worth considering. Inspecting OpenOffice.org source, these shared libraries are never unloaded with dlclose, so soffice.bin can be made much more prelink friendly and thus save substantial amount of startup time by linking against all those 76 shared libraries instead of just 34 shared libraries it is linked against. In the timings below, soffice1.bin is the original soffice.bin as created by the OpenOffice.org makefiles and soffice3.bin is the same executable linked dynamically against additional 42 shared libraries. The ordering of those 42 shared libraries matters for the number of conflict fixups, unfortunately with large C++ shared libraries there is no obvious rule for ordering them as sometimes it is more useful when a shared library precedes its dependency and sometimes vice versa, so a few different orderings were tried in several steps and always the one with smallest number of conflict fixups was chosen. Still, the number of conflict fixups is quite high and big part of the fixups are storing addresses of PLT slots in the executable into various places in shared libraries soffice2.bin is another [ This might get better when the linker is modified to handle calls without ever taking address of the function in executables specially, but only testing it will actually show it up. ] experiment, where the executable itself is empty source file, all objects which were originally in soffice.bin executable with the exception of start files were recompiled as position independent code and linked into a new shared library. This reduced number of conflicts a lot and speeded up start up times against soffice3.bin when caches are hot. It is a little bit slower than soffice3.bin when running with cold caches (e.g. for the first time after bootup), as there is one more shared library to load etc. In the timings below, numbers for soffice1.bin and soffice2.bin resp. soffice3.bin cannot be easily compared, as soffice1.bin loads less than half of the needed shared libraries which the remaining two executables load and the time to load those shared libraries doesn’t show up there. Still, when it is prelinked it takes just slightly more than two times longer to load soffice2.bin than soffice1.bin and the times are still less than 7% of how long it takes to load just the initial 34 shared libraries when not prelinking. $ S=’s/^ *//’ $ ldd /usr/lib/openoffice/program/soffice1.bin | wc -l 34 $ # Unprelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice1.bin 2>&1 | sed ”$S” 19095: runtime linker statistics: 19095: 19095: total startup time in dynamic loader: 159833582 clock cycles 19095: time needed for relocation: 155464174 clock cycles (97.2%) 19095: number of relocations: 31136 19095: number of relocations from cache: 31702 19095: number of relative relocations: 18284 19095: time needed to load objects: 3919645 clock cycles (2.4%) /usr/lib/openoffice/program/soffice1.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 19095: runtime linker statistics: 19095: 19095: final number of relocations: 31715 19095: final number of relocations from cache: 31702 $ # Prelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice1.bin 2>&1 | sed ”$S” 25759: runtime linker statistics: 25759: 25759: total startup time in dynamic loader: 4252397 clock cycles 25759: time needed for relocation: 1189840 clock cycles (27.9%) 25759: number of relocations: 0 25759: number of relocations from cache: 2142 25759: number of relative relocations: 0 25759: time needed to load objects: 2604486 clock cycles (61.2%) /usr/lib/openoffice/program/soffice1.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 25759: runtime linker statistics: 25759: 25759: final number of relocations: 24 25759: final number of relocations from cache: 2142 $ ldd /usr/lib/openoffice/program/soffice2.bin | wc -l 77 $ # Unprelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice2.bin 2>&1 | sed ”$S” 19115: runtime linker statistics: 19115: 19115: total startup time in dynamic loader: 947793670 clock cycles 19115: time needed for relocation: 936895741 clock cycles (98.8%) 19115: number of relocations: 69164 19115: number of relocations from cache: 94502 19115: number of relative relocations: 59374 19115: time needed to load objects: 10046486 clock cycles (1.0%) /usr/lib/openoffice/program/soffice2.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 19115: runtime linker statistics: 19115: 19115: final number of relocations: 69966 19115: final number of relocations from cache: 94502 $ # Prelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice2.bin 2>&1 | sed ”$S” 25777: runtime linker statistics: 25777: 25777: total startup time in dynamic loader: 10952099 clock cycles 25777: time needed for relocation: 3254518 clock cycles (29.7%) 25777: number of relocations: 0 25777: number of relocations from cache: 5309 25777: number of relative relocations: 0 25777: time needed to load objects: 6805013 clock cycles (62.1%) /usr/lib/openoffice/program/soffice2.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 25777: runtime linker statistics: 25777: 25777: final number of relocations: 24 25777: final number of relocations from cache: 5309 $ ldd /usr/lib/openoffice/program/soffice3.bin | wc -l 76 $ # Unprelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice3.bin 2>&1 | sed ”$S” 19131: runtime linker statistics: 19131: 19131: total startup time in dynamic loader: 852275754 clock cycles 19131: time needed for relocation: 840996859 clock cycles (98.6%) 19131: number of relocations: 68362 19131: number of relocations from cache: 89213 19131: number of relative relocations: 55831 19131: time needed to load objects: 10170207 clock cycles (1.1%) /usr/lib/openoffice/program/soffice3.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 19131: runtime linker statistics: 19131: 19131: final number of relocations: 69177 19131: final number of relocations from cache: 89213 $ # Prelinked system $ LD_DEBUG=statistics /usr/lib/openoffice/program/soffice3.bin 2>&1 | sed ”$S” 25847: runtime linker statistics: 25847: 25847: total startup time in dynamic loader: 12277407 clock cycles 25847: time needed for relocation: 4232915 clock cycles (34.4%) 25847: number of relocations: 0 25847: number of relocations from cache: 8961 25847: number of relative relocations: 0 25847: time needed to load objects: 6925023 clock cycles (56.4%) /usr/lib/openoffice/program/soffice3.bin X11 error: Can’t open display: Set DISPLAY environment variable, use -display option or check permissions of your X-Server (See ”man X” resp. ”man xhost” for details) 25847: runtime linker statistics: 25847: 25847: final number of relocations: 24 25847: final number of relocations from cache: 8961 Table 2: OpenOffice.org start up times without and with prelinking Type|Values (in seconds) |Avg |stddev ----+-----------------------------------------------------------+-----+------ unprelinked soffice1.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|5.569|5.149|5.547|5.559|5.549|5.139|5.55 |5.559|5.598|5.559|5.478|0.1765 user|4.65 |4.57 |4.62 |4.64 |4.57 |4.55 |4.65 |4.49 |4.52 |4.46 |4.572|0.0680 sys |0.29 |0.24 |0.19 |0.21 |0.21 |0.21 |0.25 |0.25 |0.27 |0.26 |0.238|0.0319 ----+-----------------------------------------------------------+-----+------ prelinked soffice1.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|4.946|4.899|5.291|4.879|4.879|4.898|5.299|4.901|4.887|4.901|4.978|0.1681 user|4.23 |4.27 |4.18 |4.24 |4.17 |4.22 |4.15 |4.25 |4.26 |4.31 |4.228|0.0494 sys |0.22 |0.22 |0.24 |0.26 |0.3 |0.26 |0.29 |0.17 |0.21 |0.23 |0.24 |0.0389 ----+-----------------------------------------------------------+-----+------ unprelinked soffice2.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|5.575|5.166|5.592|5.149|5.571|5.559|5.159|5.157|5.569|5.149|5.365|0.2201 user|4.59 |4.5 |4.57 |4.37 |4.47 |4.57 |4.56 |4.41 |4.63 |4.5 |4.517|0.0826 sys |0.24 |0.24 |0.21 |0.34 |0.27 |0.19 |0.19 |0.27 |0.19 |0.29 |0.243|0.0501 ----+-----------------------------------------------------------+-----+------ prelinked soffice2.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|3.69 |3.66 |3.658|3.661|3.639|3.638|3.649|3.659|3.65 |3.659|3.656|0.0146 user|2.93 |2.88 |2.88 |2.9 |2.84 |2.63 |2.89 |2.85 |2.77 |2.83 |2.84 |0.0860 sys |0.22 |0.18 |0.23 |0.2 |0.18 |0.29 |0.22 |0.23 |0.24 |0.22 |0.221|0.0318 ----+-----------------------------------------------------------+-----+------ unprelinked soffice3.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|5.031|5.02 |5.009|5.028|5.019|5.019|5.019|5.052|5.426|5.029|5.065|0.1273 user|4.31 |4.35 |4.34 |4.3 |4.38 |4.29 |4.45 |4.37 |4.38 |4.44 |4.361|0.0547 sys |0.27 |0.25 |0.26 |0.27 |0.27 |0.31 |0.18 |0.17 |0.16 |0.15 |0.229|0.0576 ----+-----------------------------------------------------------+-----+------ prelinked soffice3.bin private:factory/swriter | µ | s ----+-----------------------------------------------------------+-----+------ real|3.705|3.669|3.659|3.669|3.66 |3.659|3.659|3.661|3.668|3.649|3.666|0.0151 user|2.86 |2.88 |2.85 |2.84 |2.83 |2.86 |2.84 |2.91 |2.86 |2.8 |2.853|0.0295 sys |0.26 |0.19 |0.27 |0.25 |0.24 |0.23 |0.28 |0.21 |0.21 |0.27 |0.241|0.0303 ----+-----------------------------------------------------------+-----+------ ============================================================================ 15 Similar tools on other ELF using Operating Systems Something similar to prelink is available on other ELF platforms. On Irix there is QUICKSTART and on Solaris crle. SGI QUICKSTART is much closer to prelink from these two. The rqs program relocates libraries to (if possible) unique virtual address space slot. The base address is either specified on the command line with the -l option, or rqs uses a so_locations registry with -c or -u options and finds a not yet occupied slot. This is similar to how prelink lays out libraries without the -m option. QUICKSTART uses the same data structure for library lists (ElfNN_Lib) as prelink, but uses more fields in it (prelink doesn’t use l_version and l_flags fields at the moment) and uses different dynamic tags and section type for it. Another difference is that QUICKSTART makes all liblist section SHF_ALLOC, whether in shared libraries or executables. prelink only needs liblist section in the executable be allocated, liblist sections in shared libraries are not allocated and used at prelink time only. The biggest difference between QUICKSTART and prelink is in how conflicts are encoded. SGI stores them in a very compact format, as array of .dynsym section indexes for symbols which are conflicting. There is no information publicly available what exactly SGI dynamic linker does when it is resolving the conflicts, so this is just a guess. Given that the conflicts can be stored in a shared library or executable different to the shared library with the relocations against the conflicting symbol and different to the shared library which the symbol was originally resolved to, there doesn’t seem to be an obvious way how to handle the conflicts very cheaply. The dynamic linker probably collects list of all conflicting symbol names, for each such symbol computes ELF hash and walks hash buckets for this hash of all shared libraries, looking for the symbol. Every time it finds the symbol, all relocations against it need to be redone. Unlike this, prelink stores conflicts as an array of ElfNN_Rela structures, with one entry for each shared relocation against conflicting symbol in some shared library. This guarantees that there are no symbol lookups during program startup (provided that shared libraries have not been changed after prelinking), while with QUICKSTART will do some symbol lookups if there are any conflicts. QUICKSTART puts conflict sections into the executable and every shared library where rqs determines conflicts while prelink stores them in the executable only (but the array is typically much bigger). Disk space requirements for prelinked executables are certainly bigger than for requickstarted executables, but which one has bigger runtime memory requirements is unclear. If prelinking can be used, all .rela* and .rel* sections in the executable and all shared libraries are skipped, so they will not need to be paged in during whole program’s life (with the exception of first and last pages in the relocation sections which can be paged in because of other sections on the same page), but whole .gnu.conflict section needs to be paged in (read-only) and processed. With QUICKSTART, probably all (much smaller) conflict sections need to be paged in and also likely for each conflict whole relocation sections of each library which needs the conflict to be applied against. In QUICKSTART documentation, SGI says that conflicts are very costly and that developers should avoid them. Unfortunately, this is sometimes quite hard, especially with C++ shared libraries. It is unclear whether rqs does any optimizations to trim down the number of conflicts. Sun took completely different approach. The dynamic linker provides a dldump (const char *ipath, const char *opath, int flags); function. ipath is supposed to be a path to an ELF object loaded already in the current process. This function creates a new ELF object at opath, which is like the ipath object, but relocated to the base address which it has actually been mapped at in the current process and with some relocations (specified in flags bitmask) applied as they have been resolved in the current process. Relocations, which have been applied, are overwritten in the relocation sections with R_*_NONE relocations. The crle executable, in addition to other functions not related to startup times, with some specific options uses the dldump function to dump all shared libraries a particular executable uses (and the executable itself) into a new directory, with selected relocation classes being already applied. The main disadvantage of this approach is that such alternate shared libraries are at least for most relocation classes not shareable across different programs at all (and for those where they could be shareable a little bit there will be many relocations left for the dynamic linker, so the speed gains will be small). Another disadvantage is that all relocation sections need to be paged into the memory, just to find out that most of the relocations are R_*_NONE. ============================================================================ 16 ELF extensions for prelink Prelink needs a few ELF extensions for its data structures in ELF objects. For list of dependencies at the time of prelinking, a new section type SHT_GNU_LIBLIST is defined: #define SHT_GNU_LIBLIST 0x6ffffff7 /* Prelink library list */ typedef struct { Elf32_Word l_name; /* Name (string table index) */ Elf32_Word l_time_stamp; /* Timestamp */ Elf32_Word l_checksum; /* Checksum */ Elf32_Word l_version; /* Unused, should be zero */ Elf32_Word l_flags; /* Unused, should be zero */ } Elf32_Lib; typedef struct { Elf64_Word l_name; /* Name (string table index) */ Elf64_Word l_time_stamp; /* Timestamp */ Elf64_Word l_checksum; /* Checksum */ Elf64_Word l_version; /* Unused, should be zero */ Elf64_Word l_flags; /* Unused, should be zero */ } Elf64_Lib; New structures and section type constants used by prelink Introduces a few new special sections: Table 3: Special sections introduced by prelink Name | Type | Attributes -------------------+-----------------+----------- | In shared libraries -------------------+-----------------+----------- .gnu.liblist | SHT_GNU_LIBLIST | 0 .gnu.libstr | SHT_STRTAB | 0 .gnu.prelink_undo | SHT_PROGBITS | 0 -------------------+-----------------+----------- | In executables -------------------+-----------------+----------- .gnu.liblist | SHT_GNU_LIBLIST | SHF_ALLOC .gnu.conflict | SHT_RELA | SHF_ALLOC .gnu.prelink_undo | SHT_PROGBITS | 0 • .gnu.liblist This section contains one ElfNN_Lib structure for each shared library which the object has been prelinked against, in the order in which they appear in symbol search scope. Section’s sh_link value should contain section index of .gnu.libstr for shared libraries and section index of .dynsym for executables. l_name field contains the dependent library’s name as index into the section pointed bysh_link field. l_time_stamp resp. l_checksum should contain copies of DT_GNU_PRELINKED resp. DT_CHECKSUM values of the dependent library. • .gnu.conflict This section contains one ElfNN_Rela structure for each needed prelink conflict fixup. r_offset field contains the absolute address at which the fixup needs to be applied, r_addend the value that needs to be stored at that location. ELFNN_R_SYM of r_info field should be zero, ELFNN_R_TYPE of r_info field should be architecture specific relocation type which should be handled the same as for .rela.* sections on the architecture. For EM_ALPHA machine, all types with R_ALPHA_JMP_SLOT in lowest 8 bits of ELF64_R_TYPE should be handled as R_ALPHA_JMP_SLOT relocation, the upper 24 bits contains index in original .rela.plt section of the R_ALPHA_JMP_SLOT relocation the fixup was created for. • .gnu.libstr This section contains strings for .gnu.liblist section in shared libraries where .gnu.liblist section is not allocated. • .gnu.prelink_undo This section contains prelink private data used for prelink – undo operation. This data includes the original ElfNN_Ehdr of the object before prelinking and all its original ElfNN_Phdr and ElfNN_Shdr headers. Prelink also defines 6 new dynamic tags: #define DT_GNU_PRELINKED 0x6ffffdf5 /* Prelinking timestamp */ #define DT_GNU_CONFLICTSZ 0x6ffffdf6 /* Size of conflict section */ #define DT_GNU_LIBLISTSZ 0x6ffffdf7 /* Size of library list */ #define DT_CHECKSUM 0x6ffffdf8 /* Library checksum */ #define DT_GNU_CONFLICT 0x6ffffef8 /* Start of conflict section */ #define DT_GNU_LIBLIST 0x6ffffef9 /* Library list */ Prelink dynamic tags DT_GNU_PRELINKED and DT_CHECKSUM dynamic tags must be present in prelinked shared libraries. The corresponding d_un.d_val fields should contain time when the library has been prelinked (in seconds since January, 1st, 1970, 00:00 UTC) resp. CRC32 checksum of all sections with one of SHF_ALLOC, SHF_WRITE or SHF_EXECINSTR bit set whose type is not SHT_NOBITS, in the order they appear in the shared library’s section header table, with DT_GNU_PRELINKED and DT_CHECKSUM d_un.v_val values set to 0 for the time of checksum computation. The DT_GNU_LIBLIST and DT_GNU_LIBLISTSZ dynamic tags must be present in all prelinked executables. The d_un.d_ptr value of the DT_GNU_LIBLIST dynamic tag contains the virtual address of the .gnu.liblist section in the executable and d_un.d_val of DT_GNU_LIBLISTSZ tag contains its size in bytes. DT_GNU_CONFLICT and DT_GNU_CONFLICTSZ dynamic tags may be present in prelinked executables. d_un.d_ptr of DT_GNU_CONFLICT dynamic tag contains the virtual address of .gnu.conflict section in the executable (if present) and d_un.d_val of DT_GNU_CONFLICTSZ tag contains its size in bytes. References [1] System V Application Binary Interface, Edition 4.1. http://www.caldera.com/developers/devspecs/gabi41.pdf [2] System V Application Binary Interface, Intel 386 Architecture Processor Supplement. http://www.caldera.com/developers/devspecs/abi386-4.pdf [3] System V Application Binary Interface, AMD64 Architecture Processor Supplement. http://www.x86-64.org/cgi-bin/cvsweb.cgi/x86-64-ABI/ [4] System V Application Binary Interface, Intel Itanium Architecture Processor Supplement, Intel Corporation, 2001. http://refspecs.freestandards.org/elf/IA64-SysV-psABI.pdf [5] Steve Zucker, Kari Karhi, System V Application Binary Interface, PowerPC Architecture Processor Supplement, SunSoft, IBM, 1995. http://refspecs.freestandards.org/elf/elfspec_ppc.pdf [6] System V Application Binary Interface, PowerPC64 Architecture Processor Supplement. ftp://ftp.linuxppc64.org/pub/people/amodra/PPC-elf64abi.txt.gz [7] System V Application Binary Interface, ARM Architecture Processor Supplement. http://www.arm.com/support/566FHT/$File/ARMELF.pdf [8] SPARC Compliance Definition, Version 2.4.1, SPARC International, Inc., 1999. http://www.sparc.com/standards/SCD.2.4.1.ps.Z [9] Ulrich Drepper, How To Write Shared Libraries, Red Hat, Inc., 2003. http://people.redhat.com/drepper/dsohowto.pdf [10] Linker And Library Guide, Sun Microsystems, 2002. http://docs.sun.com/db/doc/816-1386 [11] John R. Levine, Linkers and Loaders, 1999. http://www.gzlinux.org/docs/category/dev/c/linkerandloader.pdf [12] Ulrich Drepper, ELF Handling For Thread-Local Storage, Red Hat, Inc., 2003. http://people.redhat.com/drepper/tls.pdf [13] Alan Modra, PowerPC Specific Thread Local Storage ABI, 2003. ftp://ftp.linuxppc64.org/pub/people/amodra/ppc32tls.txt.gz [14] Alan Modra, PowerPC64 Specific Thread Local Storage ABI, 2003. ftp://ftp.linuxppc64.org/pub/people/amodra/ppc64tls.txt.gz [15] DWARF Debugging Information Format Version 2. http://www.eagercon.com/dwarf/dwarf-2.0.0.pdf [16] DWARF Debugging Information Format Version 3, Draft, 2001. http://reality.sgiweb.org/davea/dwarf3-draft8-011125.pdf [17] The ”stabs” debugging information format. http://sources.redhat.com/cgi-bin/cvsweb.cgi/src/gdb/doc/stabs.texinfo?cvsroot=src 2003-11-03 First draft.