Patricia trie symbol tables?

Tue Jun 27 08:52:00 GMT 2006

On Sun, 2006-06-25 at 22:45 -0400, John Moser wrote:
> I was looking at some of the work Michael Meeks was doing, and thinking

(Now let's see if replying to a 'sent' message in evolution breaks
threading...)

Thinking on the patricia trie thing more, still haven't figured out the
ELF header or the linker or whatnot, so I'm just throwing design ideas
out.  I came up with a few thoughts:

 - We can get the actual string data from .dynstr, by giving start
address and length; but this is really a bad idea for locality (Plus it
forces us to add relocations..)
 - We should avoid more relocations; so the actual format of patricia
tries should be made to be rather relative to self if this is cheap
 - Locality is a moot point once the symbol's string is resolved; so we
don't necessarily need to store symbol addresses in the trie for
performance, instead just give the index in .dynsym

I'm making the following assumptions:

 - .dynstr stores the string names of symbols (we don't care about this)
 - .dynsym stores the addresses of those symbols (we need this)
 - I haven't a clue how or where the hash table works, but if I say this
ambiguously enough everyone else will fill in the blanks.

So ... how about this?

 - The .gnu.dynradix section contains the patricia tries

 - Hashes lead to the patricia tries, in the same way that they lead to
   hash chains normally

 - The patricia tries are linear blobs of data, with "pointers" to
   "children" when they branch as a number of bytes to skip in the data
   + We do NOT want these to be literal pointers, because then they have
     to be relocated!  (this is work)

 - The text is stored in the patricia trie itself
   + This uses more space than referencing .dynstr, but we avoid a ton
     of L2 cache misses
   + The amount of extra space is less than a copy of .dynstr because
     the common prefixes on symbol strings are combined into single
     strings

 - Symbols resolve to an index in .dynsym, which is used to find the
   actual address
   + We care more about not relocating than we do about one L2 cache
     miss at the end of a symbol look-up; so no literal address here.
     This is an ugly hack meant to allow a linker aware of it to move
     its ass, I'm not worried a few silly hacks as long as the code is
     readable and the process is reliable.

As for the tree itself I'm thinking along the lines of...

next node index:
type	size	desc
BYTE	1	Character that starts the next node
WORD	2	offset to skip forward from terminating \0 of this
    	 	node's string fragment to enter this branch

node format:
STRING	X	A \0 terminated string, the part of the string we're
      	 	searching for that's exemplified in this node
DWORD	4	32-bit value, .dynsym index of the symbol (NULL if there
     	 	is no such symbol)
BYTE	1	An unsigned 8-bit value telling how many branches can be
    	 	taken from this node (== Y)
NEXTND	3Y	A set of Y next node indexes

so let's say "zelda" and "zelba".  Looks like:

    |-----+12--------|
    |-----+19-----------------|
zel\0{0}{2}b{12}d{19}a\0{1}{0}a\0{2}{0}

More literally:

zel\0\0\0\0\0\x02b\0\x0cd\x13a\0\0\0\0\x01\0a\0\0\0\0\x02\0

Parsing this for 'zelda' would do:

 - compare 'z' 'e' 'l' '\0'
 - REMEMBER the index of this '\0' (3) as "search_idx"
 - Skip the next 4 bytes (we don't care)
 - Read the next 1 byte, nr_branches = Y
 - Iterate through branches (while (nr_branches))
   - compare next byte ('b')
   - This is not 'd', skip next 2 bytes
   - nr_branches -= 1 (now == 1)
   - compare next byte ('d')
   - This is 'd', so we're good, break the loop
 - read next two bytes (19)
 - move up to search_idx + 19
 - compare 'a' '\0'
 - This is the index we want, look up .dynsym index (2)

(yes, I use the pointer to next node as part of the node; if we have a
node that's like 'zelda' 'zeldb' and we branch, we'll hit a node that's
just '\0' and its data will indicate the looked up string)

The reason I picked an unsigned char to count branches is because I
don't think the C standard allows unicode symbol names.............

Anyway this should have the following advantages:

 - We have to relocate the section most likely; but the tries themselves
   are pretty much straight instrumentation data for a search
   algorithm.  This means we should be able to get away with ONE
   relocation (the whole section), not one for every branch in the trie
   like if we use an in-memory trie and relocate it.

 - This is a dead simple state machine.  I've never had a computer
   algorithms class and I could figure this one out.

 - The state machine is lightweight and pretty linear (== fast); compare
   some characters, take a value if we match; or skip that value,
   increment through Y character comparisons and then take a value and
   add it to an index, then go back to the beginning of the state
   machine.  Also there's the case where you run out of nodes, and
   return not found.

 - The state machine is deterministic (required).

 - Every node uses 7 additional bytes of space; but every set of symbols
   sharing the same prefix drops off all but one copy of that prefix.
   Try a C++ symbol table, it's a definite space win.
   + Should I instead use bigger values in a few places (and pad the
     strings) to keep everything aligned to 4 byte boundaries?  The
     processor tends to take slightly longer I'm told if it has to cross
     a boundary for a native word size.  We're not worried about atomic
     reads here so it doesn't strictly matter.

 - The data should all be pretty close together, branches should mainly
   fall into adjacent memory.  This is an L1/L2 cache win.

 - Since we're not relocating, most of the extra memory used will be
   shared between all applications.

At least, that's as far as I get on my limited knowledge of the topic
(all information I've used here I've actually collected since Friday...)

Any comments?

-- 
John Moser <john.r.moser@gmail.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 829 bytes
Desc: This is a digitally signed message part
URL: <https://sourceware.org/pipermail/binutils/attachments/20060627/229e2b59/attachment.sig>