This is the mail archive of the binutils@sourceware.org mailing list for the binutils project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Sun, 2006-06-25 at 22:45 -0400, John Moser wrote: > I was looking at some of the work Michael Meeks was doing, and thinking (Now let's see if replying to a 'sent' message in evolution breaks threading...) Thinking on the patricia trie thing more, still haven't figured out the ELF header or the linker or whatnot, so I'm just throwing design ideas out. I came up with a few thoughts: - We can get the actual string data from .dynstr, by giving start address and length; but this is really a bad idea for locality (Plus it forces us to add relocations..) - We should avoid more relocations; so the actual format of patricia tries should be made to be rather relative to self if this is cheap - Locality is a moot point once the symbol's string is resolved; so we don't necessarily need to store symbol addresses in the trie for performance, instead just give the index in .dynsym I'm making the following assumptions: - .dynstr stores the string names of symbols (we don't care about this) - .dynsym stores the addresses of those symbols (we need this) - I haven't a clue how or where the hash table works, but if I say this ambiguously enough everyone else will fill in the blanks. So ... how about this? - The .gnu.dynradix section contains the patricia tries - Hashes lead to the patricia tries, in the same way that they lead to hash chains normally - The patricia tries are linear blobs of data, with "pointers" to "children" when they branch as a number of bytes to skip in the data + We do NOT want these to be literal pointers, because then they have to be relocated! (this is work) - The text is stored in the patricia trie itself + This uses more space than referencing .dynstr, but we avoid a ton of L2 cache misses + The amount of extra space is less than a copy of .dynstr because the common prefixes on symbol strings are combined into single strings - Symbols resolve to an index in .dynsym, which is used to find the actual address + We care more about not relocating than we do about one L2 cache miss at the end of a symbol look-up; so no literal address here. This is an ugly hack meant to allow a linker aware of it to move its ass, I'm not worried a few silly hacks as long as the code is readable and the process is reliable. As for the tree itself I'm thinking along the lines of... next node index: type size desc BYTE 1 Character that starts the next node WORD 2 offset to skip forward from terminating \0 of this node's string fragment to enter this branch node format: STRING X A \0 terminated string, the part of the string we're searching for that's exemplified in this node DWORD 4 32-bit value, .dynsym index of the symbol (NULL if there is no such symbol) BYTE 1 An unsigned 8-bit value telling how many branches can be taken from this node (== Y) NEXTND 3Y A set of Y next node indexes so let's say "zelda" and "zelba". Looks like: |-----+12--------| |-----+19-----------------| zel\0{0}{2}b{12}d{19}a\0{1}{0}a\0{2}{0} More literally: zel\0\0\0\0\0\x02b\0\x0cd\x13a\0\0\0\0\x01\0a\0\0\0\0\x02\0 Parsing this for 'zelda' would do: - compare 'z' 'e' 'l' '\0' - REMEMBER the index of this '\0' (3) as "search_idx" - Skip the next 4 bytes (we don't care) - Read the next 1 byte, nr_branches = Y - Iterate through branches (while (nr_branches)) - compare next byte ('b') - This is not 'd', skip next 2 bytes - nr_branches -= 1 (now == 1) - compare next byte ('d') - This is 'd', so we're good, break the loop - read next two bytes (19) - move up to search_idx + 19 - compare 'a' '\0' - This is the index we want, look up .dynsym index (2) (yes, I use the pointer to next node as part of the node; if we have a node that's like 'zelda' 'zeldb' and we branch, we'll hit a node that's just '\0' and its data will indicate the looked up string) The reason I picked an unsigned char to count branches is because I don't think the C standard allows unicode symbol names............. Anyway this should have the following advantages: - We have to relocate the section most likely; but the tries themselves are pretty much straight instrumentation data for a search algorithm. This means we should be able to get away with ONE relocation (the whole section), not one for every branch in the trie like if we use an in-memory trie and relocate it. - This is a dead simple state machine. I've never had a computer algorithms class and I could figure this one out. - The state machine is lightweight and pretty linear (== fast); compare some characters, take a value if we match; or skip that value, increment through Y character comparisons and then take a value and add it to an index, then go back to the beginning of the state machine. Also there's the case where you run out of nodes, and return not found. - The state machine is deterministic (required). - Every node uses 7 additional bytes of space; but every set of symbols sharing the same prefix drops off all but one copy of that prefix. Try a C++ symbol table, it's a definite space win. + Should I instead use bigger values in a few places (and pad the strings) to keep everything aligned to 4 byte boundaries? The processor tends to take slightly longer I'm told if it has to cross a boundary for a native word size. We're not worried about atomic reads here so it doesn't strictly matter. - The data should all be pretty close together, branches should mainly fall into adjacent memory. This is an L1/L2 cache win. - Since we're not relocating, most of the extra memory used will be shared between all applications. At least, that's as far as I get on my limited knowledge of the topic (all information I've used here I've actually collected since Friday...) Any comments? -- John Moser <john.r.moser@gmail.com>
Attachment:
signature.asc
Description: This is a digitally signed message part
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |