build id computation

Wed Nov 19 01:53:00 GMT 2008

> For computing the build id following things are used:
> 
> - the ELF header, without e_phoff and e_shoff
> - all segments content
> - all sections content

It also includes the phdrs and the shdrs, without sh_offset.
(bfd/elfcode.h:elf_checksum_contents, rpm/tools/debugedit.c:handle_build_id)

The purpose of the build ID is to uniquely identify the binary created by a
build so that its ID only matches that of a semantically identical binary.

Using only the allocated sections is exactly wrong.  "Semantics" includes
the contents of all ELF sections, not only SHF_ALLOC ones.  Consider:

	$ echo 'main(){} /* war is peace */' > a.c
	$ echo 'main(){} /* kumbaya */' > b.c
	$ gcc -o a -g a.c
	$ gcc -o b -g b.c

The allocated sections of a and b are identical, as are their stripped
versions.  But the important human meaning of the two builds is that they
came from two different sources, written with two very different motivations.

With the .build-id/ directory convention and distro -debuginfo.rpm setup,
it's even possible to do:
	yum install /usr/lib/debug/.build-id/xx/xxxxx...
(e.g. scripted from "eu-readelf -n exe" or "eu-strip -n -e exe" or
"eu-unstrip -n --core core") and go from ID alone all the way to source
sitting in a tree for the package build that produced your binary (or the
binary that produced your core, even if you don't know what that is!).
That leads to learning not only what the program does, but why it's there,
who put it there, and everything about that whole package.  (Someday one
could organize a communal public registry for build IDs of all binaries
that are published or sold, pointing back to their originators.  Anyone
interested in implementing such a site, please contact me offline.)

(Incidentally, note that there is a (recent?) bug in ld's computation that
makes it emit identical IDs for the test case above.  At least in Fedora 9's
binutils-2.18.50.0.6-6.fc9.x86_64, that is; I haven't tried building the cvs
trunk lately.  cf https://bugzilla.redhat.com/show_bug.cgi?id=472152
But they should be different because of differing .symtab/.debug_* contents,
and they do end up so in the rpmbuild/debugedit recomputation, and I presume
they also do with gold.)

The only reason to use a checksum-based --build-id method is for repeatable
builds.  If you don't care about that, --build-id=uuid is fine (unique random
bits on every build).  If you repeat the same build with all the same tools
and all the same constituents (sources, libraries, etc.) such that before
--build-id the resulting binary would have been identical from the first
build to the second, then the now-default --build-id scheme doesn't want to
cause that binary to change on every iteration.

I decided to omit the few bits that have no semantic meaning in ELF at all
(e_phoff, e_shoff, sh_offset) from the checksum.  The rationale is that the
linker could change (or e.g. different cross-compile hosts could behave
differently) in trivial ways that would make output files that aren't
byte-for-byte identical but are identical as far as the semantics of ELF are
concerned.  It's not right to ignore e_phnum, e_shnum, or e_shstrndx, because
these do have semantic effects (section indices matter).  I would not really
object to having the checksum e.g. use a canonicalized .shstrtab and
corresponding sh_name values, since rearranging or de-duplicating the section
name strings does not have a semantic effect, but doing that would really be
overkill in the otherwise fairly simple checksum operation.  I don't think
there is anything wrong with the method Ian described, i.e. not omitting any
bits of the output file (just compensating for the build ID bits themselves).
My motivation for ignoring those bits was mild, and can be attributed to
either pedanticism or whim as one chooses to see it.  All that really matters
is the repeatability, meaning that a given tool (in all its cross-host
possibilities) consistently produces the same ID for the same binary.

The reason rpmbuild's debugedit recomputes the build ID is to preserve
repeatability at the granularity of the whole RPM build.  If you do two
different rpmbuild runs in identical environments but with different
_builddir settings, debugedit rewrites the different source directory names
in the DWARF info so that they become identical.  But the original build ID
bits produced by ld are different, because they were computed from DWARF data
containing the different build directory names.  So, debugedit recomputes the
build ID based on the contents of the actual binary that will be in the RPM.
Hence, two runs that produce identical DWARF also have identical build IDs.

If you are doing further post-processing of the debuginfo that is intended to
preserve its semantics, then I think this should explicitly NOT change the
build ID.  (As opposed to e.g. rewriting the source directory names as in
debugedit, which is a semantic change to the debuginfo.)  This is a prime
reason why I wanted to make very explicit that once a build ID is "baked"
into a binary, it should be conceived of only as an unique ID and never as a
checksum.  It identifies the build that produced the binary, not the results
of any later transformations made after the fact.  You might e.g. reduce or
translate the DWARF data into a compact form, or later explode that back into
the original DWARF data, or into semantically equivalent DWARF data.  All
that distillation or reprocessing after the fact is just that: after the
fact.  The original build determined the unique ID once and that's exactly
what we want the build ID to mean.  Significantly, after the strip-to-file
step the build ID in the stripped binary is fixed, but we can still do many
proper transformations on the DWARF data in the separate .debug file that's
matched up by its copy of the identical build ID note section (even years
after the stripped binaries have been shipped and installed).

(Incidentally, please contact me offline about any debuginfo post-processing
you are working on or thinking of, I'm interested to know the details.)

Thanks,
Roland