Patricia trie symbol tables?
John Moser
john.r.moser@gmail.com
Tue Jun 27 08:52:00 GMT 2006
On Sun, 2006-06-25 at 22:45 -0400, John Moser wrote:
> I was looking at some of the work Michael Meeks was doing, and thinking
(Now let's see if replying to a 'sent' message in evolution breaks
threading...)
Thinking on the patricia trie thing more, still haven't figured out the
ELF header or the linker or whatnot, so I'm just throwing design ideas
out. I came up with a few thoughts:
- We can get the actual string data from .dynstr, by giving start
address and length; but this is really a bad idea for locality (Plus it
forces us to add relocations..)
- We should avoid more relocations; so the actual format of patricia
tries should be made to be rather relative to self if this is cheap
- Locality is a moot point once the symbol's string is resolved; so we
don't necessarily need to store symbol addresses in the trie for
performance, instead just give the index in .dynsym
I'm making the following assumptions:
- .dynstr stores the string names of symbols (we don't care about this)
- .dynsym stores the addresses of those symbols (we need this)
- I haven't a clue how or where the hash table works, but if I say this
ambiguously enough everyone else will fill in the blanks.
So ... how about this?
- The .gnu.dynradix section contains the patricia tries
- Hashes lead to the patricia tries, in the same way that they lead to
hash chains normally
- The patricia tries are linear blobs of data, with "pointers" to
"children" when they branch as a number of bytes to skip in the data
+ We do NOT want these to be literal pointers, because then they have
to be relocated! (this is work)
- The text is stored in the patricia trie itself
+ This uses more space than referencing .dynstr, but we avoid a ton
of L2 cache misses
+ The amount of extra space is less than a copy of .dynstr because
the common prefixes on symbol strings are combined into single
strings
- Symbols resolve to an index in .dynsym, which is used to find the
actual address
+ We care more about not relocating than we do about one L2 cache
miss at the end of a symbol look-up; so no literal address here.
This is an ugly hack meant to allow a linker aware of it to move
its ass, I'm not worried a few silly hacks as long as the code is
readable and the process is reliable.
As for the tree itself I'm thinking along the lines of...
next node index:
type size desc
BYTE 1 Character that starts the next node
WORD 2 offset to skip forward from terminating \0 of this
node's string fragment to enter this branch
node format:
STRING X A \0 terminated string, the part of the string we're
searching for that's exemplified in this node
DWORD 4 32-bit value, .dynsym index of the symbol (NULL if there
is no such symbol)
BYTE 1 An unsigned 8-bit value telling how many branches can be
taken from this node (== Y)
NEXTND 3Y A set of Y next node indexes
so let's say "zelda" and "zelba". Looks like:
|-----+12--------|
|-----+19-----------------|
zel\0{0}{2}b{12}d{19}a\0{1}{0}a\0{2}{0}
More literally:
zel\0\0\0\0\0\x02b\0\x0cd\x13a\0\0\0\0\x01\0a\0\0\0\0\x02\0
Parsing this for 'zelda' would do:
- compare 'z' 'e' 'l' '\0'
- REMEMBER the index of this '\0' (3) as "search_idx"
- Skip the next 4 bytes (we don't care)
- Read the next 1 byte, nr_branches = Y
- Iterate through branches (while (nr_branches))
- compare next byte ('b')
- This is not 'd', skip next 2 bytes
- nr_branches -= 1 (now == 1)
- compare next byte ('d')
- This is 'd', so we're good, break the loop
- read next two bytes (19)
- move up to search_idx + 19
- compare 'a' '\0'
- This is the index we want, look up .dynsym index (2)
(yes, I use the pointer to next node as part of the node; if we have a
node that's like 'zelda' 'zeldb' and we branch, we'll hit a node that's
just '\0' and its data will indicate the looked up string)
The reason I picked an unsigned char to count branches is because I
don't think the C standard allows unicode symbol names.............
Anyway this should have the following advantages:
- We have to relocate the section most likely; but the tries themselves
are pretty much straight instrumentation data for a search
algorithm. This means we should be able to get away with ONE
relocation (the whole section), not one for every branch in the trie
like if we use an in-memory trie and relocate it.
- This is a dead simple state machine. I've never had a computer
algorithms class and I could figure this one out.
- The state machine is lightweight and pretty linear (== fast); compare
some characters, take a value if we match; or skip that value,
increment through Y character comparisons and then take a value and
add it to an index, then go back to the beginning of the state
machine. Also there's the case where you run out of nodes, and
return not found.
- The state machine is deterministic (required).
- Every node uses 7 additional bytes of space; but every set of symbols
sharing the same prefix drops off all but one copy of that prefix.
Try a C++ symbol table, it's a definite space win.
+ Should I instead use bigger values in a few places (and pad the
strings) to keep everything aligned to 4 byte boundaries? The
processor tends to take slightly longer I'm told if it has to cross
a boundary for a native word size. We're not worried about atomic
reads here so it doesn't strictly matter.
- The data should all be pretty close together, branches should mainly
fall into adjacent memory. This is an L1/L2 cache win.
- Since we're not relocating, most of the extra memory used will be
shared between all applications.
At least, that's as far as I get on my limited knowledge of the topic
(all information I've used here I've actually collected since Friday...)
Any comments?
--
John Moser <john.r.moser@gmail.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 829 bytes
Desc: This is a digitally signed message part
URL: <https://sourceware.org/pipermail/binutils/attachments/20060627/229e2b59/attachment.sig>
More information about the Binutils
mailing list