hwcaps subdirectory selection in the dynamic loader

Stefan Liebler stli@linux.ibm.com
Tue May 12 15:23:07 GMT 2020

On 5/8/20 8:26 PM, Florian Weimer via Libc-alpha wrote:
> As part of my work on bug 23249, I looked at how the dynamic loader
> finds and selects alternative implementations of shared objects based on
> hardware capabilities (hwcaps).  This message intends to capture my
> understanding of this feature.
> The implementation largely happens via elf/dl-hwcaps.c, dl-procinfo.h,
> elf/dl-load.c, and elf/ldconfig.c, elf/dl-cache.c for ld.so.cache.  On
> typical targets, the kernel provides hardware capability bits via
> AT_HWCAP auxiliary vector entry, and a platform string AT_PLATFORM.
> # Non-cache lookups
> For non-cache (LD_LIBRARY_PATH) lookups, the dynamic loader needs to
> guess pathnames.  It does not use readdir.  The supported hwcap bits
> (usually supplied by the kernel via AT_HWCAP) are filtered with the
> compile-time mask HWCAP_IMPORTANT.  Each bit corresponds to a
> subdirectory name, as returned by _dl_hwcap_string.  Two fake hwcap bits
> and corresponding subdirectory are added by the loader: the TLS bit with
> the directory name "tls", and the platform bit, with the AT_PLATFROM
> string provided by the kernel as the directory name.  The dynamic loader
> then computes the power set of those directory names.  The full paths
> are constructed by concatenating the subdirectory names of the set bits,
> starting with "tls", the AT_PLATFORM directory, and then the active real
> hwcap bits, going from more significant to less significant bits.  The
> power set is enumerated starting with all bits set, and then proceeds to
> remove bits according to an integer decrementing pattern.
> (Please ignore the NEED_DL_SYSINFO_DSO part in elf/dl-hwcaps.c because
> it is no longer used in practice since the nosegneg removal on i686.)
> This is no sysdeps override for this search path construction.  An
> architecture can only affect how the hwcap bits are computed, to which
> strings individal bits correspond, and what the platform subdirectory is
> called.  The fake two bits (TLS and platform) and the power-set
> construction always apply.
> I'm using s390x as an example now because the situation is fairly simple
> compared to other architectures and I have it around for testing.  I
> think it's broadly representable of what other architectures do.
> On a zEC12 machine with the zarch, ldisp, eimm, dfp bits (but non of the
> vx and later bits), the search paths looks like this:
>    tls/zEC12/dfp/eimm/ldisp/zarch
>    tls/zEC12/dfp/eimm/ldisp
>    tls/zEC12/dfp/eimm/zarch
>    tls/zEC12/dfp/eimm
>    tls/zEC12/dfp/ldisp/zarch
>    tls/zEC12/dfp/ldisp
>    tls/zEC12/dfp/zarch
>    tls/zEC12/dfp
>    tls/zEC12/eimm/ldisp/zarch
>    tls/zEC12/eimm/ldisp
>    tls/zEC12/eimm/zarch
>    tls/zEC12/eimm
>    tls/zEC12/ldisp/zarch
>    tls/zEC12/ldisp
>    tls/zEC12/zarch
>    tls/zEC12
>    tls/dfp/eimm/ldisp/zarch
>    tls/dfp/eimm/ldisp
>    tls/dfp/eimm/zarch
>    tls/dfp/eimm
>    tls/dfp/ldisp/zarch
>    tls/dfp/ldisp
>    tls/dfp/zarch
>    tls/dfp
>    tls/eimm/ldisp/zarch
>    tls/eimm/ldisp
>    tls/eimm/zarch
>    tls/eimm
>    tls/ldisp/zarch
>    tls/ldisp
>    tls/zarch
>    tls
>    zEC12/dfp/eimm/ldisp/zarch
>    zEC12/dfp/eimm/ldisp
>    zEC12/dfp/eimm/zarch
>    zEC12/dfp/eimm
>    zEC12/dfp/ldisp/zarch
>    zEC12/dfp/ldisp
>    zEC12/dfp/zarch
>    zEC12/dfp
>    zEC12/eimm/ldisp/zarch
>    zEC12/eimm/ldisp
>    zEC12/eimm/zarch
>    zEC12/eimm
>    zEC12/ldisp/zarch
>    zEC12/ldisp
>    zEC12/zarch
>    zEC12
>    dfp/eimm/ldisp/zarch
>    dfp/eimm/ldisp
>    dfp/eimm/zarch
>    dfp/eimm
>    dfp/ldisp/zarch
>    dfp/ldisp
>    dfp/zarch
>    dfp
>    eimm/ldisp/zarch
>    eimm/ldisp
>    eimm/zarch
>    eimm
>    ldisp/zarch
>    ldisp
>    zarch
> And finally the actual search path entry is searched.  On a z13 machine,
> there would one more bit (vx), and the platform directory has a
> different name, "z13".  So the first path is
> tls/z13/vx/dfp/eimm/ldisp/zarch, and there are twice as many lookups.
> This scheme allows a library developer to require any combination of the
> HWCAP_IMPORTANT bits for an optimized object, by placing it in the
> appropriate subdirectory.  But it does not scale well as more bits are
> added.  There is some path backlisting in elf/dl-load.c, so this is not
> as bad as it looks here, but the first lookup in a library search path
> entry will consult all the directories (i.e., there is no blacklisting
> of say the tls/ subtree if the tls subdirectory does not exist).
Would it be possible to blacklist the remaining tls/... paths in this case?
> # Cache lookups
> ldconfig uses a completely different way to locate objects in hwcaps
> subdirectories.  To build the cache, it lists directories, and if in
> those directories, it encounters a name that corresponds to a hwcap
> directory name or a (hard-coded) platform name, it queues this
> subdirectory for later listing, descending further in the tree along
> these paths.  This means that paths like those quoted above are also
> supported by ldconfig, except that it is more lenient and does not
> enforce any particular order on hwcap names.
> Only the second cache format (involving struct file_entry_new) can
> represent libraries in hwcaps subdirectories.  There is a single
> uint64_t field which identifies the implied hardware capabilities.
> Regular hwcap bits are represented as themselves (after converting from
> the subdirectory name to the bit value), and all the bits are OR-ed
> together.  If a platform directory is encountered in the path, a number
> is computed using _dl_string_platform from its name, and this number is
> then used as a fake bit index (outside of the supported real hwcap bits,
> see _DL_FIRST_PLATFORM) to compute another bitmask that is OR-ed into
> the hwcap field in the cache.
> ldconfig tries to sort entries for the same soname according to some
> heuristic (see the compare function in elf/cache.c): hwcap entries with
> more bits generally come first.
> At run time, the dynamic loader finds all matching path entries for a
> soname in the cache, and then picks the first entry that matches the
> hwcap and platform requirements (see HWCAP_CHECK in elf/dl-cache.c).
> # Discussion
> I think there a couple of problems with this approach.  One subtle
> problem involves the AT_PLATFORM encoding in the cache file (bug 25938).
> But I think there are other issues.
> The LD_LIBRARY_PATH/non-cache case is rather wasteful in terms of system
> calls, even with the blacklisting in place.
Yes, that's true. And I assume the most paths are never used.

> The heuristics for choosing the implementation is not very obvious.  Of
> course, with bitmasks of opaque CPU features, there is no generic
> winner.  For example, on s390x z13, a library in a subdirectory
> ldisp/zarch would be preferred over one in vx because the former has
> more matching hwcap bits and comes earlier in the ld.so.cache sort order
> (but not the LD_LIBRARY_PATH order).  This is counter-intuitive because
> vx (the z13 vector capability) should imply the other capabilities—the
> library was just placed into the wrong directory.
> The most tempting choice for such optimizations is the platform
> directory for architectures that have it ("zEC12" in the example above).
> But the problem is that if the system administrator upgrades the machine
> to z13, the directory name would change to "z13", and the optimized code
> would no longer be loaded!  (Presumably, the zEC12-optimized code is
> still better than the generic code on z13.  The same issue would apply
> to z13-optimized code vs z14-optimized code.)  This would be a reason
> not use AT_PLATFORM from the kernel even on s390x.
> There is another reason to distrust AT_PLATFORM: virtualization.  If
> AT_PLATFORM is set by some sort of machine ID (as on s390x), then it
> might not match the actual hwcap bits available to the guest because
> they are subject to separate knobs.
> The complexity of the trade-offs here suggests to me that we (the GNU
> toolchain as a whole) should try to pre-define names for collections of
> hwcap flags, so that we can get a monotonic progression of features
> under a clearly defined name.  This will allow programmers to optimize
> for subsequent microarchitecture revisions.  So instead of "x86_64" we
> would have pseudo-capabilities like "x86-200", "x86-201", "x86-202" and
> so on, more or less mirroring the "zEC12", "z13" &c platform directories
> on s390x, even though the kernel does not provide such platform names on
> x86-64.  Even on platforms that provide an AT_PLATFORM name, in most
> cases, it would make sense to use *earlier* platform names as a fallback
> (so that z15 system would also use z14- and z13-optimized libraries if
> available).  This would mean that the dynamic loader would need to know
> more about these relationships.
Even on s390x, it's not really simple. E.g. platform "z13" does not 
automatically mean that hwcap "vx" is available (e.g. if you are running 
as zVM guest where an old zVM version does not support vx). But if hwcap 
"vx" is available, it is at least platform "z13".

As far as I know, the kernel currently provides "z900" as AT_PLATFORM 
for new unknown machines instead of the latest known platform string, 
e.g. "z15". But there could be hwcap flags for newer machines than 
"z900" (e.g. hwcap "vx"). Would the loader also recognize this and test 
z13 and all the former platforms?
> The current hwcap construction is not really suited to that.
> ld.so.cache is better matched than the LD_LIBRARY_PATH search with its
> mandatory power set construction.  Even agressive tree pruning will
> still see it make at least one system call per search path entry and
> hwcap.  So I don't think we can use this mechanism for future changes.
> The way we store hwcap bits in ld.so.cache is also not ideal.  It would
> be nice if ldconfig could be hwcap-agnostic, not having to care at all
> of the correspondence between subdirectory name and hwcap bit (or
> AT_PLATFORM pseudo-hwcap bit).  I think I have a way to encode that
> while still maintaining ld.so.cache backwards compatibility (basically,
> set the currently unused bit 62 on those new hwcap entries, so that
> older loaders ignore them because of a missed hwcap requirement).
> If we put new hwcap subdirectories under a *single* subdirectory (say
> "glibc-hwcaps"), then we could prune paths more aggressively, and use
> the new scheme in parallel to the old without much impact on performance
> until these subdirectories are actually used.  ldconfig could also treat
> the presence of a glibc-hwcaps subdirectory has an instruction to
> descend into each subdirectory of the glibc-hwcaps directory, but not
> further, and store the names of those subdirectories in ld.so.cache, so
> that the loader can match them at run time.
This means, the LD_LIBRARY_PATH/non-cache case would first try all 
directories inside glibc-hwcaps and if no suitable library was found, 
the current approach is used?

Are "new hwcaps" also allowed in the current approach or are those only 
allowed in the "glibc-hwcaps" directory?

Is nesting the "new hwcaps" allowed in "glibc-hwcaps" directory and if 
yes, which heuristics for choosing the library is used? Compare to youre 
example above: "ldisp/zarch" vs "vx".

"store the names of those subdirectories in ld.so.cache, so
that the loader can match them at run time.": This means if a library is 
placed in a new subdirectory without calling ldconfig again, this 
library is not found?
> In any case, I do not see a way to make good progress on bug 23249 (the
> "haswell" platform subdirectory issue on various x86-64 variants)
> without tackling some of these isssues.
> Thoughts?
> Thanks,
> Florian

More information about the Libc-alpha mailing list