This is the mail archive of the mailing list for the binutils project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: DT_GNU_HASH: ~ 50% dynamic linking improvement

Hi Jakub,

On Wed, 2006-06-28 at 19:09 +0200, Jakub Jelinek wrote:
> The following patches introduce an optional ELF hash section

	Dudie; you rock ! - thanks for taking this on board; this is really
nice :-) and here was I thinking I was going to have to write a length
paper to try to convince Ulrich :-) You made my month (really).

> The initial design was done by Ulrich Drepper and was discussed with
> Michael Meeks a few months ago as well.  But nothing came off of these
> discussions and the proposed approach is quite different.

	This looks rather like -zdynsort + -zhash-vals prototype; but I'm too
pleased to quibble :-)

	Anyhow - I've been thinking about various other optimisations, many of
which we can easily & neatly fit into this nice framework, I've numbered
them to help chewing them over:

	1. Extensibility
		+ I want to see a multiply per lookup.
		+ ie. since the L2 cache misses are what hammer us -
		  if [sic] we want to extend the .chain -Bdirect
		  support in future - being able to put that in
		  the same record is really important.
		+ a 'record size' field gives us that with ~no
		  bloat as of now.
		+ [ and I really think some form of direct linking
		  is the future - but we can discuss that later 
		  clearly ]
		+ ie. having an extensibility story that is better
		  than 'just add a new section [ somewhere miles
		  away in memory/not-in-cache ] is not so fun.

	2. chain walking:
		+ the 'not having a .chain <-> .dynsym offset 
		  equivalence is an interesting innovation over my
		  approach; having said that I'm not sure why
		  you did that. Most chains are rather short - and
		  adding a set of bytes that are almost always
		  length 1/2/3 seems strange; may I suggest a
		  better approach ?
		+ Since we can sort (almost all) of .dynsym -
		  there is no problem ensuring that ~all symbols
		  that we don't want in .hash occur at the end
		  of the .dynsym table [ cf. the next UNDEF point ].
		+ Furthermore - by inspection it can be seen that
		  1/2^32 ~=== 1/2^30 [1]
			+ using this theorem, we can easily steal
			  1/2 bits from the top of the stored hash
			  and use them as chain terminators.
			+ ie. lets have:

.hash[symhash % nbuckets] -> .chain[5]

.chain[5] -> [ (not-end) | hashval[5] & 0x3fff ]
.chain[6] -> [ (not-end) | hashval[6] & 0x3fff ]
.chain[7] -> [ (is-end)  | hashval[7] & 0x3fff ]

			+ that saves us several bytes per chain,
			  and cf. above, no measurable loss in
			  performance. [ of course chain[5] 
			  <-> .dynsym[5] here as before ]

		+ in fact, for extensibility a per-shlib hash
		  comparison 'mask' field would make for a rather
		  better behaviour - so we can snarf bits from
		  here in future if we come up with brighter ideas.

	3. What we hash:
		+ it is incredible to me that UNDEF symbols are
		  included in the symbol hash; I see no legitimate
		  use for that [ eg. 'free' is an expensive hit in the
		  hash for virtually every app ;-]
		+ with 2 bits, a 'chain terminator' and a
		  'defined terminator' if we don't bin the undef's
		  from the hash completely we can at least sort
		  them to the end of the chain and often avoid
		  comparing them - [ currently they would be
		  filtered out only in check_match after a .dynsym
		  check for "if ((sym->st_value == 0 /* No value.  */...
		      || (type_class & (sym->st_shndx == SHN_UNDEF)))"
		+ This may seem an insignificant win: most
		  'normal' libs I've seen only import 1/3rd
		  of their symbols
			+ OTOH - C++ 'plugins' [ which we can't
			  load RTLD_LOCAL because of vague symbols
			  tend to have map files / visibility markup 
			  to reduce their exposed symbol count
			  substantially, eg.
			+ as an example 'writer' lives in 'libsw' using
			  objdump -T  | grep UND | wc -l:
				Undefined: 6465
				Defined:   3580
			+ ie. for every 'defined' symbol that someone 
			  might legitimately want to look up we have
			  ~2 other symbols that they just -cannot- want
			  to compare.
		+ ie. in some cases by dropping UNDEF symbols, we can
		  have our new / larger .chain section and still save 
		  space [ on disk & in cache ].

	So - some comments on your patch. I notice you don't sort .dynstr as I
did ;-) that simplifies life clearly; of course - I have devious reasons
for wanting that. Wrt. -Bdirect if we sort our relocations right and use
-Bdirect, we can turn linking into a linear walk of both the library
list, .hash, .chain, .dynsym, and .dynstr [ if we play our cards
right ;-]; but that's for another day [ I'd just love
to make linking completely disappear from the L2 cache miss profile
(when using C) - I think it can be done ;-].

	Reading the patch - I really like it :-) my calculation logic to lookup
the hash stuff in a separate section was *really* ugly and evil, and
would have required some re-factoring to make it at all pleasant, your
chain walking code is sweet and pleasant to the eye etc. :-)

	Anyhow - we can build some other rather nice optimisations on top of
this I think; and if we can do some of the above we can save space as
well for a pure gnu-hash system I think. My only concern really is with

	Thanks so much for doing this - of course, as cache sizes increase this
may appear to be less useful[2]; but the faster we can make the linker,
the more we can split up OO.o into more libs, lazy load more,
componentize more and (hopefully) save size/time on startup & at



[1] - so your paragraph on the hash:

	"530000 unique dynamic symbols ... Dan Bernstein's hash ... fewest
collisions (only 29) ... The standard SysV ELF hash function gave bad
results ... 1726 collisions"

	Is interesting - and I'm sure the hash you propose is great - but of
course, with the hash comparison the reduction in the number of L2 cache
misses is so great that the 'quality' of the hash at this level is
rather uninteresting wrt. the optimisation [ IHMO etc. ].

	ie. 29/530000 ~= 1726/530000, well nearly anyway ;-) focusing on a
0.005% hit vs. 0.3% is interesting but not worth fighting over. More
interestingly can we pick a few bits to steal and only degrade that by a
factor of 4 or so ? ie. 0.005% vs. 0.02% ;-)

[2] - the Core Duo here with it's 2Mb cache is amazingly faster for OO.o
linking, but clearly at smaller sizes there is this appalling cliff to
drop-off wrt. linking performance.

--  <><, Pseudo Engineer, itinerant idiot

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]