This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: dwarf_output overview

From: Roland McGrath <roland at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Wed, 06 Oct 2010 13:55:10 -0700
Subject: Re: dwarf_output overview
> I found this hard to follow in the source code.

Sorry about that.  Suggestions for clean-up are always welcome.

> We are using a subr::value_set class which will be called with an input

This is little more than an STL unordered_set, which is the basic C++
hash table interface.  It's entirely generic to whatever value_type it's
instantiated for.  The only thing it does beyond the basic unordered_set
is to wrap the object in a hashed_value, which just means a cache of the
hash-value computation so we can compare the whole hash value rather than
recomputing the hash function on each comparison.  The idea is that
actual hash collisions should be extremely rare, though hash%bucketsize
collisions might not be as rare.  So we save the whole hash value in the
set element, and compare whole hash values before comparing actual objects.

>       template<typename input, typename arg_type>
>       const value_type *add (const input &v, arg_type &arg)
>       {
> 	return add (value_type (v, arg));
>       }
> 
> So it will always create a new value_type

Right.  That's because in practice the input is always of a different
type, not the actual value_type.  e.g., the input value argument will be
a const char * or a const std::string, when the value_type is actually
dwarf_output::value::value_string.  The value_string template constructor
handles those as input types.

> (which is always of a dwarf_output::value::subtype?) 

value_set is generic, but indeed our only uses of it are with a
value_type that is dwarf_output::value::value_something.

> and ultimately calls the (sub)struct hasher () operator on that value:

Right.

> So for a simple attr_value we don't care about the hasher of the input
> (dwarf/dwarf_edit) variant, but will always create a new
> dwarf_output::attr_value from the given input variant. Which just
> records the given input attr_value variant:

We don't have any value_set containing attr_value, which is just the
dynamic-typing wrapper object.  We have value sets for the individual
dwarf_output::value::value_something types.  

For example, dwarf_data::value_string is the element type inside the
_m_strings set.  A value_string is derived from std::string and its
template constructor passes the input value argument to the normal
std::string constructor.  That takes a std::string or a const char *, so
we just get the copied string.  value_string::hasher uses more layers of
subr::hash* stuff, so it turns into a normal string hash function
(algorithm stolen from boost).

When it gets into the value_set, it's wrapped in a hashed_value<value_string>
object, which is just the value_string with its precomputed hash value.
The standard unordered_set logic then does the actual hash table insertion,
but its "hash function" is just to fetch the stored _m_hash from the
hashed_value<value_string>.  This is so that we aren't recomputing the
string hash function of the elements inside each hash bucket to compare
to the new element, just doing integer comparison on the saved _m_hash.

The value_set::add method always returns the pointer to the object
sitting in the hash bucket.  If this value was already in the set,
we haven't added anything.  In either case, we just throw away the
just-created value_type (e.g. value_string) object, because unordered_set
stores its own copy.  

All of this stuff is essentially uninteresting.  It's just the layers of
C++ hooey to keep a hash table for each value_something type, and return
the const value_something * pointing into that hash table.

That is the bottom level of deduplication, for the simple "leaf"
attribute values.  There is nothing really fancy in there at all,
just hash tables.

So, for each string "foo", each identifier "bar", each constant 23, etc.,
there is just a single value_something object holding that identical
value.  For value_flag, we don't have a hash table, because there are of
course exactly two object it could ever be (true and false); it's
equivalent, dwarf_output_collector::flag (bool) yields one of the two
possible const dwarf_output::value::value_flag pointers.  Hence, for all
the simple attribute value flavors, an identical input value reduces to
a single const value_flavor pointer.

Now, all those values are only interesting because they are attribute
values, sitting in some attribute map.  You'll recall that an attribute
map is a dictionary of <int, attr_value> pairs, where the int is the
DW_AT_* constant.  The attr_value type is just a wrapper around a single
pointer, which points to one of those const value_flavor objects stored
in some value set.

Hence, under the layers of wrapper types, an attribute map is a
dictionary mapping an int to a const value_something pointer.  Because
the there is exactly one pointer to one instance of value_something for a
given concrete value (string "foo", constant 23, etc.), we can compare an
attribute with two simple integer comparisons (i.e. the int for integer
equality, and the pointer for pointer equality).

If all attribute values were just the simple "leaf" flavors, then we
could treat the whole attribute map as a single "simple" value, i.e.
just the list of <int, pointer> pairs.  If we pretend there are no
references, then that's just what we are doing.  

There is only one more wrinkle (aside from the big one, references),
which is that we actually have to keep track of the DIE tag (DW_TAG_*
constant) that goes with each attribute map.  I am actually failing to
recall exactly why we need that.  Let's leave that aside for now.

We want to store the attribute maps in a hash table too, so we can
consolidate any two identical whole maps to one attributes_type object.
To do that, we need a hash function for the whole map.  Naturally, the
hash function for a dictionary is just to combine the hash values of each
element pair.  dwarf_output::attributes_type::do_hash does that work.
Recall, each element pair is <int, attr_value>.  So the hash function
for the pair combines the hash of the int with the hash of the attr_value.
For integer types, template specializations get us to subr::integer_hash.

> So we will use the dwarf_output::value_attr hasher struct:

dwarf_output::attr_value::hasher, that is.

>       /* We can use the _m_value pointer itself as a perfect hash,
> 	 because all identical values share the same cell in the
> 	 collector.  

Like the comment says, this is simple because the _m_value pointer is one
of those const value_something pointers mentioned above.  We have already
resolved those to the unique object for that simple value.  We will never
have two different value_something objects in the collector whose
contents are identical, just the one pointer into the appropriate
value_set's copy.

Is that clear enough now, except for the whole issue of references?

To work through it and see how this all happens as I describe, you could
try making a simple dwarf_edit object from scratch to compose a tiny tree
with a few DIEs that have some attributes but no use references at all.
Then do a 'dwarf_output foo (edit_object, collector);' construction and
step through the copy-construction of the attribute maps.  For example,
takes tests/dwarf_edit.cc and tweak it a little to avoid the reference
use, and to have two attributes in the tree with the same value (two
identifiers "foo" e.g., or two instances DW_AT_byte_size=4 or whatever).

Note the existing example includes:

  ent.attributes ()[DW_AT_name].identifier () = "foo";
  ent.attributes ()[DW_AT_description] = ent.attributes ()[DW_AT_name];

This copies the "foo" string from the DW_AT_name just stored to the new
DW_AT_description, but it would be no different if it just used:

  ent.attributes ()[DW_AT_name].identifier () = "foo";
  ent.attributes ()[DW_AT_description].identifier () = "foo";

or even:

  ent.attributes ()[DW_AT_name].identifier () = "foo";
  ent.attributes ()[DW_AT_description].identifier ().append ("fxo");
  ent.attributes ()[DW_AT_description].identifier ()[1] = 'o';

There in the dwarf_edit object, you have two separate (mutable) string
objects.  When this whole tree gets copied into a dwarf_output object,
it will do the hash/comparison of each string and resolve them to the
same dwarf_output::value::value_identifier object in the collector.

Likewise, if you had:

cu.add_entry (DW_TAG_base_type)->attributes ()[DW_AT_name].identifier () = "a";
cu.add_entry (DW_TAG_base_type)->attributes ()[DW_AT_name].identifier () = "a";

then you get two different base_type entries whose attribute maps are
identical.  In the dwarf_edit they will be separate but equal at every
level.  When you copy this into a dwarf_output object, these will wind up
consolidated down to the same value_identifier object, then the same
attributes_type object, and finally, the same debug_info_entry object.

This is all the "simple stuff".  In the implementation, you'll see that
_m_attr_sets is not a value_set.  That's because of the references issues.
We'll get to that next, when this much is no longer troubling your mind.


Thanks,
Roland
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]