We seem to have inconsistent bias behavior for the build-id in ET_EXEC, and the problem seems to be from missing debuginfo. Note that prelink is often another culprit of bias issues, but I did "prelink -u" for these first. For my first point of comparison, consider ET_DYN /usr/bin/stap, whether or not I have systemtap-debuginfo installed: $ stap -e 'probe process.plt {next}' -c /usr/bin/stap \ -p3 -vv --poison-cache |& grep build-id Found build-id in /usr/bin/stap, length 20, start at 0x284 For relocatable binaries, it's reasonable that we'd get a plain file offset. Now ET_EXEC /usr/local/bin/stap, with debuginfo baked in: $ stap -e 'probe process.plt {next}' -c /usr/local/bin/stap \ -p3 -vv --poison-cache |& grep build-id Found build-id in /usr/local/bin/stap, length 20, start at 0x400284 It's not relocatable, and now we have an absolute address, ok. Now ET_EXEC /usr/bin/ls without coreutils-debuginfo: $ stap -e 'probe process.plt {next}' -c /usr/bin/ls \ -p3 -vv --poison-cache |& grep build-id Found build-id in /usr/bin/ls, length 20, start at 0x284 So that's inconsistent -- not relocatable, but it's a file offset. Now ET_EXEC /usr/bin/ls *with* coreutils-debuginfo: $ stap -e 'probe process.plt {next}' -c /usr/bin/ls \ -p3 -vv --poison-cache |& grep build-id Found build-id in /usr/bin/ls, length 20, start at 0x400284 We got the absolute address back! That "Found build-id" line is in translate.cxx dump_build_id(). In a debugger I can see that dwfl_module_build_id() is giving 0x400284 either way, but when debuginfo is missing, the dwfl_module_relocate_address() kills the absolute bias.
Here's another example, probing function("_start") because that will resolve from the symbol table either way. You can see this with "main" too, but it will be resolving from debuginfo when available, so it's a very different path. With coreutils-debuginfo: $ stap -e 'probe process.function("_start") {next}' -c /usr/bin/ls -p2 # probes process("/usr/bin/ls").function("_start") /* pc=.absolute+0x4e3c */ /* <- process("/usr/bin/ls").function("_start") */ Without coreutils-debuginfo: $ stap -e 'probe process.function("_start") {next}' -c /usr/bin/ls -p2 # probes process("/usr/bin/ls").function("_start") /* pc=.dynamic+0x4e3c */ /* <- process("/usr/bin/ls").function("_start") */ This explains why we aren't bitten by the buildid more often. For inode-uprobes, we always ultimately use a file-offset "address", but .absolute/.dynamic affects how we get task_finder callbacks. For .absolute, we use a process callback and fake a 0 "relocation", so having an absolute build-id address from there works fine. For .dynamic, we use an mmap callback where we know the relocation, so having a relative build-id address also works. But process.plt is always giving me .absolute, which fails if the build-id address was relative. Unless it happens to follow a function probe, then it will becomes .dynamic too. :/ So maybe process.plt just needs to trigger something in dwfl to make it always follow suit? (Honestly, I'd rather get rid of the ".absolute" concept, and convert everything to ".dynamic" with relative addresses, but that may be more invasive.)
Consider: $ ./stap -e 'probe process.plt("strstr"),process.function("_start") {next}' -c /usr/bin/ls --poison-cache -p2 # probes process("/usr/bin/ls").statement(0x402c00) /* pc=.absolute+0x2c00 */ /* <- process("/usr/bin/ls").plt("strstr").statement(0x402c00) */ process("/usr/bin/ls").function("_start") /* pc=.dynamic+0x4e3c */ /* <- process("/usr/bin/ls").plt("strstr"),process("/usr/bin/ls").function("_start") */ In one run, we changed our mind from .absolute to .dynamic!?! We make this decision in dwflpp::relocate_address, which looks at dwfl_module_relocations(). That function will return 0 if mod->e_type is ET_EXEC, or 1 if mod->e_type is ET_DYN. And sure enough, the e_type is changing in the middle of this run. A hardware watchpoint tells me where: libdwfl/dwfl_module_getdwarf.c 134│ mod->e_type = ehdr->e_type; 135│ 136│ /* Relocatable Linux kernels are ET_EXEC but act like ET_DYN. */ 137│ if (mod->e_type == ET_EXEC && file->vaddr != mod->low_addr) 138├> mod->e_type = ET_DYN; (gdb) bt #0 0x000000370481dc4b in open_elf (file=file@entry=0x2185be0, mod=<optimized out>, mod=<optimized out>) at dwfl_module_getdwarf.c:138 #1 0x000000370481e4b1 in find_aux_sym (aux_strshndx=<synthetic pointer>, aux_xndxscn=<synthetic pointer>, aux_symscn=<synthetic pointer>, mod=0x2185b60) at dwfl_module_getdwarf.c:907 #2 find_symtab (mod=mod@entry=0x2185b60) at dwfl_module_getdwarf.c:1022 #3 0x000000370481ee8e in dwfl_module_getsymtab (mod=0x2185b60) at dwfl_module_getdwarf.c:1259 #4 0x00000000004e4c24 in symbol_table::get_from_elf (this=0x2188fb0) at ../tapsets.cxx:7806 So when it opened the aux minisymtab (.gnu_debugdata), this triggered a kernel heuristic that really should not apply to this case. FWIW, file->vaddr = 0x400020, and mod->low_addr = 0x400000.
Should be fixed by elfutils commit 65cefbd0793c0f9e90a326d7bebf0a47c93294ad Author: Josh Stone <jistone@redhat.com> Date: Tue Mar 11 10:19:28 2014 -0700 libdwfl: dwfl_module_getdwarf.c (open_elf) only (re)set mod->e_type once. As noted in https://sourceware.org/bugzilla/show_bug.cgi?id=16676#c2 for systemtap, the heuristic used by open_elf to set the kernel Dwfl_Module type to ET_DYN, even if the underlying ELF file e_type was set to ET_EXEC, could trigger erroneously for non-kernel/non-main (debug or aux) files. Make sure we only set the e_type of the module once when processing the main file (when the phdrs can be trusted).
I confirmed on elfutils-0.158-2.fc21, ET_EXEC stays ".absolute" in all cases.