hwcaps subdirectory selection in the dynamic loader

Florian Weimer fweimer@redhat.com
Fri May 8 18:26:44 GMT 2020


As part of my work on bug 23249, I looked at how the dynamic loader
finds and selects alternative implementations of shared objects based on
hardware capabilities (hwcaps).  This message intends to capture my
understanding of this feature.

The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
typical targets, the kernel provides hardware capability bits via
AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.

# Non-cache lookups

For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
guess pathnames.  It does not use readdir.  The supported hwcap bits
(usually supplied by the kernel via AT_HWCAP) are filtered with the
compile-time mask HWCAP_IMPORTANT.  Each bit corresponds to a
subdirectory name, as returned by _dl_hwcap_string.  Two fake hwcap bits
and corresponding subdirectory are added by the loader: the TLS bit with
the directory name "tls", and the platform bit, with the AT_PLATFROM
string provided by the kernel as the directory name.  The dynamic loader
then computes the power set of those directory names.  The full paths
are constructed by concatenating the subdirectory names of the set bits,
starting with "tls", the AT_PLATFORM directory, and then the active real
hwcap bits, going from more significant to less significant bits.  The
power set is enumerated starting with all bits set, and then proceeds to
remove bits according to an integer decrementing pattern.

(Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
it is no longer used in practice since the nosegneg removal on i686.)

This is no sysdeps override for this search path construction.  An
architecture can only affect how the hwcap bits are computed, to which
strings individal bits correspond, and what the platform subdirectory is
called.  The fake two bits (TLS and platform) and the power-set
construction always apply.

I'm using s390x as an example now because the situation is fairly simple
compared to other architectures and I have it around for testing.  I
think it's broadly representable of what other architectures do.

On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
vx and later bits), the search paths looks like this:

  tls/zEC12/dfp/eimm/ldisp/zarch
  tls/zEC12/dfp/eimm/ldisp
  tls/zEC12/dfp/eimm/zarch
  tls/zEC12/dfp/eimm
  tls/zEC12/dfp/ldisp/zarch
  tls/zEC12/dfp/ldisp
  tls/zEC12/dfp/zarch
  tls/zEC12/dfp
  tls/zEC12/eimm/ldisp/zarch
  tls/zEC12/eimm/ldisp
  tls/zEC12/eimm/zarch
  tls/zEC12/eimm
  tls/zEC12/ldisp/zarch
  tls/zEC12/ldisp
  tls/zEC12/zarch
  tls/zEC12
  tls/dfp/eimm/ldisp/zarch
  tls/dfp/eimm/ldisp
  tls/dfp/eimm/zarch
  tls/dfp/eimm
  tls/dfp/ldisp/zarch
  tls/dfp/ldisp
  tls/dfp/zarch
  tls/dfp
  tls/eimm/ldisp/zarch
  tls/eimm/ldisp
  tls/eimm/zarch
  tls/eimm
  tls/ldisp/zarch
  tls/ldisp
  tls/zarch
  tls
  zEC12/dfp/eimm/ldisp/zarch
  zEC12/dfp/eimm/ldisp
  zEC12/dfp/eimm/zarch
  zEC12/dfp/eimm
  zEC12/dfp/ldisp/zarch
  zEC12/dfp/ldisp
  zEC12/dfp/zarch
  zEC12/dfp
  zEC12/eimm/ldisp/zarch
  zEC12/eimm/ldisp
  zEC12/eimm/zarch
  zEC12/eimm
  zEC12/ldisp/zarch
  zEC12/ldisp
  zEC12/zarch
  zEC12
  dfp/eimm/ldisp/zarch
  dfp/eimm/ldisp
  dfp/eimm/zarch
  dfp/eimm
  dfp/ldisp/zarch
  dfp/ldisp
  dfp/zarch
  dfp
  eimm/ldisp/zarch
  eimm/ldisp
  eimm/zarch
  eimm
  ldisp/zarch
  ldisp
  zarch

And finally the actual search path entry is searched.  On a z13 machine,
there would one more bit (vx), and the platform directory has a
different name, "z13".  So the first path is
tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.

This scheme allows a library developer to require any combination of the
HWCAP_IMPORTANT bits for an optimized object, by placing it in the
appropriate subdirectory.  But it does not scale well as more bits are
added.  There is some path backlisting in elf/dl-load.c, so this is not
as bad as it looks here, but the first lookup in a library search path
entry will consult all the directories (i.e., there is no blacklisting
of say the tls/ subtree if the tls subdirectory does not exist).

# Cache lookups

ldconfig uses a completely different way to locate objects in hwcaps
subdirectories.  To build the cache, it lists directories, and if in
those directories, it encounters a name that corresponds to a hwcap
directory name or a (hard-coded) platform name, it queues this
subdirectory for later listing, descending further in the tree along
these paths.  This means that paths like those quoted above are also
supported by ldconfig, except that it is more lenient and does not
enforce any particular order on hwcap names.

Only the second cache format (involving struct file_entry_new) can
represent libraries in hwcaps subdirectories.  There is a single
uint64_t field which identifies the implied hardware capabilities.
Regular hwcap bits are represented as themselves (after converting from
the subdirectory name to the bit value), and all the bits are OR-ed
together.  If a platform directory is encountered in the path, a number
is computed using _dl_string_platform from its name, and this number is
then used as a fake bit index (outside of the supported real hwcap bits,
see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
the hwcap field in the cache.

ldconfig tries to sort entries for the same soname according to some
heuristic (see the compare function in elf/cache.c): hwcap entries with
more bits generally come first.

At run time, the dynamic loader finds all matching path entries for a
soname in the cache, and then picks the first entry that matches the
hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).

# Discussion

I think there a couple of problems with this approach.  One subtle
problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
But I think there are other issues.

The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
calls, even with the blacklisting in place.

The heuristics for choosing the implementation is not very obvious.  Of
course, with bitmasks of opaque CPU features, there is no generic
winner.  For example, on s390x z13, a library in a subdirectory
ldisp/zarch would be preferred over one in vx because the former has
more matching hwcap bits and comes earlier in the ld.so.cache sort order
(but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
vx (the z13 vector capability) should imply the other capabilities—the
library was just placed into the wrong directory.

The most tempting choice for such optimizations is the platform
directory for architectures that have it ("zEC12" in the example above).
But the problem is that if the system administrator upgrades the machine
to z13, the directory name would change to "z13", and the optimized code
would no longer be loaded!  (Presumably, the zEC12-optimized code is
still better than the generic code on z13.  The same issue would apply
to z13-optimized code vs z14-optimized code.)  This would be a reason
not use AT_PLATFORM from the kernel even on s390x.

There is another reason to distrust AT_PLATFORM: virtualization.  If
AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
might not match the actual hwcap bits available to the guest because
they are subject to separate knobs.

The complexity of the trade-offs here suggests to me that we (the GNU
toolchain as a whole) should try to pre-define names for collections of
hwcap flags, so that we can get a monotonic progression of features
under a clearly defined name.  This will allow programmers to optimize
for subsequent microarchitecture revisions.  So instead of "x86_64" we
would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
so on, more or less mirroring the "zEC12", "z13" &c platform directories
on s390x, even though the kernel does not provide such platform names on
x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
cases, it would make sense to use *earlier* platform names as a fallback
(so that z15 system would also use z14- and z13-optimized libraries if
available).  This would mean that the dynamic loader would need to know
more about these relationships.

The current hwcap construction is not really suited to that.
ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
mandatory power set construction.  Even agressive tree pruning will
still see it make at least one system call per search path entry and
hwcap.  So I don't think we can use this mechanism for future changes.

The way we store hwcap bits in ld.so.cache is also not ideal.  It would
be nice if ldconfig could be hwcap-agnostic, not having to care at all
of the correspondence between subdirectory name and hwcap bit (or
AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
while still maintaining ld.so.cache backwards compatibility (basically,
set the currently unused bit 62 on those new hwcap entries, so that
older loaders ignore them because of a missed hwcap requirement).

If we put new hwcap subdirectories under a *single* subdirectory (say
"glibc-hwcaps"), then we could prune paths more aggressively, and use
the new scheme in parallel to the old without much impact on performance
until these subdirectories are actually used.  ldconfig could also treat
the presence of a glibc-hwcaps subdirectory has an instruction to
descend into each subdirectory of the glibc-hwcaps directory, but not
further, and store the names of those subdirectories in ld.so.cache, so
that the loader can match them at run time.

In any case, I do not see a way to make good progress on bug 23249 (the
"haswell" platform subdirectory issue on various x86-64 variants)
without tackling some of these isssues.

Thoughts?

Thanks,
Florian



More information about the Libc-alpha mailing list