This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
IFUNC resolver scheduling (bugs 21041, 20019)
- From: Florian Weimer <fweimer at redhat dot com>
- To: GNU C Library <libc-alpha at sourceware dot org>
- Cc: "Carlos O'Donell" <carlos at redhat dot com>
- Date: Wed, 25 Jan 2017 14:57:43 +0100
- Subject: IFUNC resolver scheduling (bugs 21041, 20019)
- Authentication-results: sourceware.org; auth=none
I tried to fix most of the issues involving IFUNC resolvers and their
access to unrelocated symbols.
My first attempt (on branch fw/bug21041) records all relocations against
IFUNC resolvers during elf_machine_rela processing, and adds another
pass to apply them.
The general approach is to execute IFUNC resolvers
deepest-dependency-first (based on the object in which the IFUNC
resolver resides, not the relocation which references it). Most of the
time, this ensures that IFUNC resolvers do not encounter unrelocated
symbols because IFUNC resolvers from their dependencies have already
executed, patching in the missing relocation. Symbol interposition can
still bypass that, though.
The branch is currently slightly buggy because it reverses the order of
relocation processing within the same object, and it does not delay copy
relocations, so those could copy unrelocated data. (Support for copy
relocations whose source address is determined by an IFUNC resolver is
missing as well, but this is a separate matter and quite easy to fix.)
Support for DT_TEXTREL is currently missing, too. This needs another
preparatory patch like the one to delayed RELRO activation. (For
dlopen, the RELRO activation delay does a bit too much work because it
mprotects everything again, not just the newly allocated objects.)
One problem is that recording the relocations requires temporary memory
allocations during relocation. I have a mmap-based pool allocator for
that (currently very specific to the task at hand, but I will have to
generalize it a bit). The allocations are required so that we do not
have to walk through large parts of the relocation list in the second
pass. The number of delayed relocations is significant: around 100 for
simple binaries, and more than 1,000 for complex ones. The current
delayed relocation record contains six words, so we are talking about
several KiB even for small binaries. I'm not convinced it makes sense
to add a large static stack allocation (like scratch_buffer does) to
avoid calling mmap because the stack memory remains dirty and might not
be reused by some programs. Adding at least one mmap and munmap call to
the start-up sequence is likely visible in benchmarks, though.
The combreloc feature sorts relocations of the same type together. The
static linker could sort relocations involving IFUNCs together, and we
could recognize this in the dynamic linker and record spans of
relocations instead individual relocations. This would reduce the
number of delayed relocation records we'd have to allocate.
However, the current dynamic linker architecture makes this difficult:
We do not exploit the combreloc optimization very well; we dispatch on
the next relocation type (with a large switch statement) even though it
is extremely likely that the type is the same as the current type. If
we change that, recognizing spans would become easier, too.
Apart from the performance impact and complexity, there is another issue
I worry about: What happens if an IFUNC-based relocation is subject to a
copy relocation? Assuming we delay IFUNC relocations and copy
relocations and execute them in that order, the data produced by the
copy relocation is correct because the IFUNC relocation has been
performed before the copy is made. But what happens if the
copy-relocated symbol is accessed by the IFUNC resolver? I think it
will touch the object at its final location, and that has not been
relocated yet. I suppose this is similar to the dependency bypass due
to function interposition. Maybe it's sufficient to document this; it
is certainly easier to avoid than the current restriction on accessing
*any* relocated data. This looks like a circular dependency which is
impossible to resolve in general (and rather expensive to detect).
On the positive side, nothing in this affects lazy binding and _dl_fixup.
In retrospect, adding IFUNCs as a user-visible feature looks like a
mistake, but now we have to live with them.
If no one has a better idea, I'll generalize my mmap-based
bump-pointer/pool allocator, use that to record copy relocations as
well, and maybe try to squeeze out a few bytes here and there.
Comments?
Thanks,
Florian