This is the mail archive of the
mailing list for the binutils project.
Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
- From: Michael Meeks <michael dot meeks at novell dot com>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: binutils at sources dot redhat dot com, libc-alpha at sources dot redhat dot com, Ulrich Drepper <drepper at redhat dot com>
- Date: Thu, 29 Jun 2006 19:27:06 +0100
- Subject: Re: DT_GNU_HASH: ~ 50% dynamic linking improvement
- References: <20060628170900.GX3823@sunsite.mff.cuni.cz>
- Reply-to: michael dot meeks at novell dot com
On Wed, 2006-06-28 at 19:09 +0200, Jakub Jelinek wrote:
> The following patches introduce an optional ELF hash section
Dudie; you rock ! - thanks for taking this on board; this is really
nice :-) and here was I thinking I was going to have to write a length
paper to try to convince Ulrich :-) You made my month (really).
> The initial design was done by Ulrich Drepper and was discussed with
> Michael Meeks a few months ago as well. But nothing came off of these
> discussions and the proposed approach is quite different.
This looks rather like -zdynsort + -zhash-vals prototype; but I'm too
pleased to quibble :-)
Anyhow - I've been thinking about various other optimisations, many of
which we can easily & neatly fit into this nice framework, I've numbered
them to help chewing them over:
+ I want to see a multiply per lookup.
+ ie. since the L2 cache misses are what hammer us -
if [sic] we want to extend the .chain -Bdirect
support in future - being able to put that in
the same record is really important.
+ a 'record size' field gives us that with ~no
bloat as of now.
+ [ and I really think some form of direct linking
is the future - but we can discuss that later
+ ie. having an extensibility story that is better
than 'just add a new section [ somewhere miles
away in memory/not-in-cache ] is not so fun.
2. chain walking:
+ the 'not having a .chain <-> .dynsym offset
equivalence is an interesting innovation over my
approach; having said that I'm not sure why
you did that. Most chains are rather short - and
adding a set of bytes that are almost always
length 1/2/3 seems strange; may I suggest a
better approach ?
+ Since we can sort (almost all) of .dynsym -
there is no problem ensuring that ~all symbols
that we don't want in .hash occur at the end
of the .dynsym table [ cf. the next UNDEF point ].
+ Furthermore - by inspection it can be seen that
1/2^32 ~=== 1/2^30 
+ using this theorem, we can easily steal
1/2 bits from the top of the stored hash
and use them as chain terminators.
+ ie. lets have:
.hash[symhash % nbuckets] -> .chain
.chain -> [ (not-end) | hashval & 0x3fff ]
.chain -> [ (not-end) | hashval & 0x3fff ]
.chain -> [ (is-end) | hashval & 0x3fff ]
+ that saves us several bytes per chain,
and cf. above, no measurable loss in
performance. [ of course chain
<-> .dynsym here as before ]
+ in fact, for extensibility a per-shlib hash
comparison 'mask' field would make for a rather
better behaviour - so we can snarf bits from
here in future if we come up with brighter ideas.
3. What we hash:
+ it is incredible to me that UNDEF symbols are
included in the symbol hash; I see no legitimate
use for that [ eg. 'free' is an expensive hit in the
hash for virtually every app ;-]
+ with 2 bits, a 'chain terminator' and a
'defined terminator' if we don't bin the undef's
from the hash completely we can at least sort
them to the end of the chain and often avoid
comparing them - [ currently they would be
filtered out only in check_match after a .dynsym
check for "if ((sym->st_value == 0 /* No value. */...
|| (type_class & (sym->st_shndx == SHN_UNDEF)))"
+ This may seem an insignificant win: most
'normal' libs I've seen only import 1/3rd
of their symbols
+ OTOH - C++ 'plugins' [ which we can't
load RTLD_LOCAL because of vague symbols
tend to have map files / visibility markup
to reduce their exposed symbol count
+ as an example 'writer' lives in 'libsw' using
objdump -T libsw680li.so | grep UND | wc -l:
+ ie. for every 'defined' symbol that someone
might legitimately want to look up we have
~2 other symbols that they just -cannot- want
+ ie. in some cases by dropping UNDEF symbols, we can
have our new / larger .chain section and still save
space [ on disk & in cache ].
So - some comments on your patch. I notice you don't sort .dynstr as I
did ;-) that simplifies life clearly; of course - I have devious reasons
for wanting that. Wrt. -Bdirect if we sort our relocations right and use
-Bdirect, we can turn linking into a linear walk of both the library
list, .hash, .chain, .dynsym, and .dynstr [ if we play our cards
right ;-]; but that's for another day [ I'd just love
to make linking completely disappear from the L2 cache miss profile
(when using C) - I think it can be done ;-].
Reading the patch - I really like it :-) my calculation logic to lookup
the hash stuff in a separate section was *really* ugly and evil, and
would have required some re-factoring to make it at all pleasant, your
chain walking code is sweet and pleasant to the eye etc. :-)
Anyhow - we can build some other rather nice optimisations on top of
this I think; and if we can do some of the above we can save space as
well for a pure gnu-hash system I think. My only concern really is with
Thanks so much for doing this - of course, as cache sizes increase this
may appear to be less useful; but the faster we can make the linker,
the more we can split up OO.o into more libs, lazy load more,
componentize more and (hopefully) save size/time on startup & at
 - so your paragraph on the hash:
"530000 unique dynamic symbols ... Dan Bernstein's hash ... fewest
collisions (only 29) ... The standard SysV ELF hash function gave bad
results ... 1726 collisions"
Is interesting - and I'm sure the hash you propose is great - but of
course, with the hash comparison the reduction in the number of L2 cache
misses is so great that the 'quality' of the hash at this level is
rather uninteresting wrt. the optimisation [ IHMO etc. ].
ie. 29/530000 ~= 1726/530000, well nearly anyway ;-) focusing on a
0.005% hit vs. 0.3% is interesting but not worth fighting over. More
interestingly can we pick a few bits to steal and only degrade that by a
factor of 4 or so ? ie. 0.005% vs. 0.02% ;-)
 - the Core Duo here with it's 2Mb cache is amazingly faster for OO.o
linking, but clearly at smaller sizes there is this appalling cliff to
drop-off wrt. linking performance.
firstname.lastname@example.org <><, Pseudo Engineer, itinerant idiot