Dynamic linking optimizations
John Richard Moser
nigelenki@comcast.net
Tue Mar 3 07:30:00 GMT 2009
The dynamic linking process, as I understand it, has some basic
performance considerations. Particularly, you may have many symbols
with similar prefixes to look up, such as the following:
libFunkyStuff_main_loop
libFunkyStuff_order_data
libFunkyStuff_order_ops
libGraphics_garbage
sndlib_close_sound_device
sndlib_open_sound_device
sndlib_play_sound
To handle this, we do a number of things:
1. Generate a simple chaining hash table, drop collisions into buckets.
2. Point from that hash table into a symbol table.
3. Somehow associate the symbol with the offset in the file
So on lookup we have the following task:
1. Generate a hash for the symbol
2. Look up the proper entry in the hash table for a DSO.
3. Character-by-character compare for each entry in the bucket.
4. If we get a match, we've found it. Else check the next DSO.
For the lookup, (1) can be handled by generating a hash or by
pre-computing hashes and storing them (DT_GNU_HASH).
So the simple question I want to ask: why in the hell would you do this?
If you have to flat out compare everything, no hash table, then you make
duplicate comparisons. For $NUM_STRINGS strings prefixed with the same
prefix of $PFX_LEN characters, looking up every such string would cause
you to traverse that exact prefix $NUM_STRINGS * ($NUM_STRINGS + 1) / 2
times.
Okay, so hashing gives you a way to mitigate this. If you have 2048
symbols, and 1024 buckets in the hash table, and your hash algorithm is
perfect for the given case, then you'll traverse each prefix at most TWO
times per lookup (i.e. two similarly-prefixed strings hit the same
bucket) and at least ONE time per lookup. This is much better.
In practice, you can still end up with the same string in the same
bucket 40 times, just by pure bad luck and just the right (or wrong)
hashing table size. Arbitrarily adjust the size and your luck changes
and you're good (welcome to -Wl,-O1).
Again: why in the hell would you do this?
Let's start playing with theory (nobody cares, show us the code! I
don't have any). I say something blindingly obvious, and you tell me
why this is an obviously bad approach. If I can A) prove you wrong; or
B) change the approach to erase the bad stuff, then I have a better
solution. If not, then the obvious has happened: A bunch of people who
are a hell of a lot smarter than me came up with something much better
than I could this morning over coffee, and I'll be unplugging my
speakers before they start emitting penis-shaped sound waves.
Let's look at the symbol list given above, and work on the naive case
without the hash table. Now say we want to look for sndlib_play_sound()
... we'll do the following comparisons:
l{skip}
l{skip}
l{skip}
l{skip}
sndlib_c{skip}
sndlib_o{skip}
sndlib_play_sound\0
That's 38 character comparisons, 6 adjunct operations ("Skip this
symbol, it's not the one we want").
Now here's an idea you've seen before, because I posted it before with
little to no thought on how to implement it. This time I've come up
with something for that, but more on that later. Let's arrange this in
a Patricia Trie.
---lib--FunkyStuff_--main_loop
| | |-order_--data
| | |-ops
| |-Graphics_garbage
|-sndlib_--close_sound_device
|-open_sound_device
|-play_sound
Starting at the root, we'd need ...
l{skip}
s{follow}
ndlib_{branch}
c{skip}
o{skip}
p{follow}
lay_sound\0
Now let's say that each of these branches is an array of structs:
struct trie_branch {
char key; /* Next character in string */
unsigned void *offset; /* distance to jump to continue */
}
And that each time we branch, we see another struct:
struct trie_branch_header {
unsigned char branches; /* how many branches */
/* What? You think it's more complicated than that? */
}
This means "skip" is effectively character comparison and increment an
index:
for (i = 0; i < num_branches; i++) {
if (branch_list[i].key == symbol_name[index])
follow_branch(branch_list[i].key);
}
Following of course has an adjunct operation attached to it: you have
to compute an address from an address offset (pointer arithmetic).
Amortizing 'skip' to the character comparison that it is gives us the
below behavior:
ls{follow}
ndlib_{branch}
cop{follow}
lay_sound\0
21 character comparisons, 3 operations in between. (there is no corner
case that creates more comparisons. It may be possible to intentionally
create a horrible trie where you branch every character, by iterating
every possible duple at every level i.e. aa* ab* ac* -> aaa* aab*
aac*... you would also need ($valid_chars)^($length_of_string) symbols
to pull this off, over 200,000 symbols for any string of 3 characters
for example given [a-zA-Z0-9])
Now here's the kicker: How do you lay this out as a flat data
structure? Sure this is a beautiful idea, you'll never need a hash
table, each time you enter the DSO to do a symbol lookup against its
symbol table you'll only compare the same prefix ONE or ZERO times; but
how do you get this on disk and in a format appropriate for loading by
the dynamic linker? How do you deal with cache behavior using this garbage?
If the symbols are best normally ordered in alphabetical order, then the
basic answer is simple: sort the branches in alphabetical order (i.e.
sndlib{c,o,p} not sndlib{p,c,o}). When converting this complex
structure to a flat bitstream, follow it by naive recursion: branch from
the root out to the FIRST branch, and the FIRST branch off that, and the
FIRST branch off that, etc; then back off ONE level and follow the
SECOND, then THIRD; and so on. For our example:
---lib--FunkyStuff_--main_loop
| | |-order_--data
| | |-ops
| |-Graphics_garbage
|-sndlib_--close_sound_device
|-open_sound_device
|-play_sound
Write Branches: {l,s}
Write 'l': ib
Write Branches: {F,G}
Write 'F': unkyStuff_
Write Branches: {m,o}
Write 'm': ain_loop
Write 'o': rder_
Write Branches: {d,o}
Write 'd': ata
Write 'o': ps
Write 'G': raphics_garbage
Write 's': ndlib_
Write Branches: {c,o,p}
Write 'c': lose_sound_device
Write 'o': pen_sound_device
Write 'p': lay_sound
So from the root...
{hdr:2}\0{'l',$addr_to_l0}{'s',$addr_to_s0}{hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}
and so on. Breaking down the above, {hdr:2} means this header:
struct trie_branch_header {
unsigned char branches; /* how many branches */
/* What? You think it's more complicated than that? */
}
with a value that indicates '2' branches. In practice we have to
represent 256 possible values, so we'd probably store ($NUM_BRANCHES -
1) i.e. store 0 for 1, 1 for 2, and so on. Implementation detail.
Following this header is the string for this portion. For the root,
that's ... null string, which is one byte of no data terminated with a
null terminator '\0' (i.e. it's just \0).
Following this is {'l',$addr_to_l0}{'s',$addr_to_s0}, which represents
the two branches off the root, towards "lib" and "sndlib_" respectively.
These are:
struct trie_branch {
char key; /* Next character in string */
unsigned void *offset; /* distance to jump to continue */
}
And "offset' represents the offset from the beginning of the whole trie.
Since we know how many branches are available to take, there's no way to
run off the end of this, so we can order these things however we like...
Following that is {hdr:2}ib\0{'F',$addr_to_F0}{'G',$addr...}. This is
where the branch for 'l' off the root goes. It's "\0" "l" "ib\0" i.e.
"lib" so far. It has 2 branches, 'F' and 'G' heading to libFunkyStuff
and libGraphics symbols.
You get the idea.
Reading it back would be trivial. Further, if you search for symbols in
alphabetical order and write out in the order stated, you'll always do
the bulk of the work near the beginning of the data structure in the
beginning, and inch further along as you continue. It wouldn't be a
linear read, but it'd be concentrated in hot areas as you moved through
the lookup process. Theoretically, when you finished with one branch,
you'd be reading just before the start of where the next branch would
jump to, and you may just wind up following the next branch right into
something the CPU's precache algorithms pulled into cache already.
Non-trivially, you could read ordered *and* *keep* *track* *of* *the*
*dynamic* *linker's* *state*. In other words, if I'm reading around
libFunkyStuff_* branch, I can push each level I travel deeper onto a
stack. When I'm no longer matching with libFunkyStuff_ (determined
through heuristics logic), I can pop off that stack, step back, and not
have to repeat the comparison from the root down. My lookup for
libGraphicsGarbage can start looking at the branch for 'G' (knowing I
just came from 'F' and it's sorted), avoiding a few character
comparisons... or if there's a hell of a lot under there (huge C++ name
spaces with thousands of symbols), a LOT of character comparisons.
Of course that's non-trivial and I have no idea how you'd do that or
what the performance costs are. Simplicity is key here.
Does anyone have any valid reason why this wouldn't work? Am I a raging
nut? Please explain any flaws in my logic, because if they're there
I've obviously ALREADY failed to see them and telling me to think about
it a little harder won't work (I first came up with this idea when Meeks
proposed DynSort).
More information about the Binutils
mailing list