Dynamic linking optimizations

John Richard Moser nigelenki@comcast.net
Tue Mar 3 07:30:00 GMT 2009


The dynamic linking process, as I understand it, has some basic 
performance considerations.  Particularly, you may have many symbols 
with similar prefixes to look up, such as the following:

libFunkyStuff_main_loop
libFunkyStuff_order_data
libFunkyStuff_order_ops
libGraphics_garbage
sndlib_close_sound_device
sndlib_open_sound_device
sndlib_play_sound

To handle this, we do a number of things:

1.  Generate a simple chaining hash table, drop collisions into buckets.

2.  Point from that hash table into a symbol table.

3.  Somehow associate the symbol with the offset in the file


So on lookup we have the following task:

1.  Generate a hash for the symbol

2.  Look up the proper entry in the hash table for a DSO.

3.  Character-by-character compare for each entry in the bucket.

4.  If we get a match, we've found it.  Else check the next DSO.


For the lookup, (1) can be handled by generating a hash or by 
pre-computing hashes and storing them (DT_GNU_HASH).

So the simple question I want to ask:  why in the hell would you do this?

If you have to flat out compare everything, no hash table, then you make 
duplicate comparisons.  For $NUM_STRINGS strings prefixed with the same 
prefix of $PFX_LEN characters, looking up every such string would cause 
you to traverse that exact prefix $NUM_STRINGS * ($NUM_STRINGS + 1) / 2 
times.

Okay, so hashing gives you a way to mitigate this.  If you have 2048 
symbols, and 1024 buckets in the hash table, and your hash algorithm is 
perfect for the given case, then you'll traverse each prefix at most TWO 
times per lookup (i.e. two similarly-prefixed strings hit the same 
bucket) and at least ONE time per lookup.  This is much better.

In practice, you can still end up with the same string in the same 
bucket 40 times, just by pure bad luck and just the right (or wrong) 
hashing table size.  Arbitrarily adjust the size and your luck changes 
and you're good (welcome to -Wl,-O1).

Again:  why in the hell would you do this?


Let's start playing with theory (nobody cares, show us the code!  I 
don't have any).  I say something blindingly obvious, and you tell me 
why this is an obviously bad approach.  If I can A) prove you wrong; or 
B) change the approach to erase the bad stuff, then I have a better 
solution.  If not, then the obvious has happened:  A bunch of people who 
are a hell of a lot smarter than me came up with something much better 
than I could this morning over coffee, and I'll be unplugging my 
speakers before they start emitting penis-shaped sound waves.


Let's look at the symbol list given above, and work on the naive case 
without the hash table.  Now say we want to look for sndlib_play_sound() 
... we'll do the following comparisons:

l{skip}
l{skip}
l{skip}
l{skip}
sndlib_c{skip}
sndlib_o{skip}
sndlib_play_sound\0

That's 38 character comparisons, 6 adjunct operations ("Skip this
symbol, it's not the one we want").

Now here's an idea you've seen before, because I posted it before with 
little to no thought on how to implement it.  This time I've come up 
with something for that, but more on that later.  Let's arrange this in 
a Patricia Trie.

  ---lib--FunkyStuff_--main_loop
  |    |            |-order_--data
  |    |                    |-ops
  |    |-Graphics_garbage
  |-sndlib_--close_sound_device
           |-open_sound_device
           |-play_sound

Starting at the root, we'd need ...

l{skip}
s{follow}
ndlib_{branch}
c{skip}
o{skip}
p{follow}
lay_sound\0

Now let's say that each of these branches is an array of structs:

struct trie_branch {
	char		key; /* Next character in string */
	unsigned void	*offset; /* distance to jump to continue */
}

And that each time we branch, we see another struct:

struct trie_branch_header {
	unsigned char	branches; /* how many branches */
	/* What?  You think it's more complicated than that? */
}

This means "skip" is effectively character comparison and increment an
index:

for (i = 0; i < num_branches; i++) {
     if (branch_list[i].key == symbol_name[index])
         follow_branch(branch_list[i].key);
}

Following of course has an adjunct operation attached to it:  you have 
to compute an address from an address offset (pointer arithmetic). 
Amortizing 'skip' to the character comparison that it is gives us the 
below behavior:

ls{follow}
ndlib_{branch}
cop{follow}
lay_sound\0

21 character comparisons, 3 operations in between.  (there is no corner 
case that creates more comparisons.  It may be possible to intentionally 
create a horrible trie where you branch every character, by iterating 
every possible duple at every level i.e. aa* ab* ac* -> aaa* aab* 
aac*... you would also need ($valid_chars)^($length_of_string) symbols 
to pull this off, over 200,000 symbols for any string of 3 characters 
for example given [a-zA-Z0-9])

Now here's the kicker:  How do you lay this out as a flat data 
structure?  Sure this is a beautiful idea, you'll never need a hash 
table, each time you enter the DSO to do a symbol lookup against its 
symbol table you'll only compare the same prefix ONE or ZERO times; but 
how do you get this on disk and in a format appropriate for loading by 
the dynamic linker?  How do you deal with cache behavior using this garbage?

If the symbols are best normally ordered in alphabetical order, then the
basic answer is simple:  sort the branches in alphabetical order (i.e.
sndlib{c,o,p} not sndlib{p,c,o}).  When converting this complex
structure to a flat bitstream, follow it by naive recursion: branch from
the root out to the FIRST branch, and the FIRST branch off that, and the
FIRST branch off that, etc; then back off ONE level and follow the
SECOND, then THIRD; and so on.  For our example:

---lib--FunkyStuff_--main_loop
  |    |            |-order_--data
  |    |                    |-ops
  |    |-Graphics_garbage
  |-sndlib_--close_sound_device
           |-open_sound_device
           |-play_sound

Write Branches: {l,s}
Write 'l':  ib
Write Branches: {F,G}
Write 'F':  unkyStuff_
Write Branches:  {m,o}
Write 'm':  ain_loop
Write 'o':  rder_
Write Branches:  {d,o}
Write 'd':  ata
Write 'o':  ps
Write 'G':  raphics_garbage
Write 's':  ndlib_
Write Branches:  {c,o,p}
Write 'c':  lose_sound_device
Write 'o':  pen_sound_device
Write 'p':  lay_sound

So from the root...

{hdr:2}\0{'l',$addr_to_l0}{'s',$addr_to_s0}{hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}

and so on.  Breaking down the above, {hdr:2} means this header:

struct trie_branch_header {
	unsigned char	branches; /* how many branches */
	/* What?  You think it's more complicated than that? */
}

with a value that indicates '2' branches.  In practice we have to 
represent 256 possible values, so we'd probably store ($NUM_BRANCHES - 
1) i.e. store 0 for 1, 1 for 2, and so on.  Implementation detail.

Following this header is the string for this portion.  For the root, 
that's ... null string, which is one byte of no data terminated with a 
null terminator '\0' (i.e. it's just \0).

Following this is {'l',$addr_to_l0}{'s',$addr_to_s0}, which represents 
the two branches off the root, towards "lib" and "sndlib_" respectively. 
These are:

struct trie_branch {
	char		key; /* Next character in string */
	unsigned void	*offset; /* distance to jump to continue */
}

And "offset' represents the offset from the beginning of the whole trie. 
Since we know how many branches are available to take, there's no way to 
run off the end of this, so we can order these things however we like...

Following that is {hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}.  This is 
where the branch for 'l' off the root goes.  It's "\0" "l" "ib\0" i.e. 
"lib" so far.  It has 2 branches, 'F' and 'G' heading to libFunkyStuff 
and libGraphics symbols.

You get the idea.

Reading it back would be trivial.  Further, if you search for symbols in 
alphabetical order and write out in the order stated, you'll always do 
the bulk of the work near the beginning of the data structure in the 
beginning, and inch further along as you continue.  It wouldn't be a 
linear read, but it'd be concentrated in hot areas as you moved through 
the lookup process.  Theoretically, when you finished with one branch, 
you'd be reading just before the start of where the next branch would 
jump to, and you may just wind up following the next branch right into 
something the CPU's precache algorithms pulled into cache already.

Non-trivially, you could read ordered *and* *keep* *track* *of* *the* 
*dynamic* *linker's* *state*.  In other words, if I'm reading around 
libFunkyStuff_* branch, I can push each level I travel deeper onto a 
stack.  When I'm no longer matching with libFunkyStuff_ (determined 
through heuristics logic), I can pop off that stack, step back, and not 
have to repeat the comparison from the root down.  My lookup for 
libGraphicsGarbage can start looking at the branch for 'G' (knowing I 
just came from 'F' and it's sorted), avoiding a few character 
comparisons... or if there's a hell of a lot under there (huge C++ name 
spaces with thousands of symbols), a LOT of character comparisons.

Of course that's non-trivial and I have no idea how you'd do that or 
what the performance costs are.  Simplicity is key here.

Does anyone have any valid reason why this wouldn't work?  Am I a raging 
nut?  Please explain any flaws in my logic, because if they're there 
I've obviously ALREADY failed to see them and telling me to think about 
it a little harder won't work (I first came up with this idea when Meeks 
proposed DynSort).



More information about the Binutils mailing list