This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: dwarf_edit_output testcase sanity checking

From: Roland McGrath <roland at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Tue, 09 Nov 2010 00:41:23 -0800
Subject: Re: dwarf_edit_output testcase sanity checking
> So the problem is the hash of a reference attribute not being "stable"
> in the face of circularity. The dirty solution of just always using a
> hash of zero for a reference makes things indeed a lot worse, too many
> things hash together to be practical.

Right.

> I tried to hack together a "solution" that used the referent's
> non-reference attributes and tag. That showed a bug which looks
> suspiciously like what we tried to debug before:
> https://fedorahosted.org/pipermail/elfutils-devel/2010-September/001584.html

The dag burn fedorahosted.org webserver is not responding at the moment,
so I can't look that up right now.

> So I will try to get to the bottom of that first (somewhere when we try
> to finalize a die the attributes get misplaced). But in practice this
> might also not be enough to create very distinct hashes.

It certainly won't in the case of <pointer_type type=.../> or
<reference_type type=.../>.  Those are almost certainly part of
every circular reference chain in practice.

> Another idea would be to use not use a "perfect hash" for reference
> attributes. Allow some equal reference attributes to not hash the same.

I don't see how that would fly, unless, as you say, we do it only
temporarily and then replace them with perfect ones later.  The matching
hashes is the core of the whole plan for duplicate elimination.

> The nice thing about most of the hashing done in dwarf_output is that it
> "uniqifies" value/die groups and so creates a perfect unique hash. This
> can be done with ordered sets of values or die trees. But the reference
> attributes introduce arbitrary (possibly circular) graphs between
> elements. We get in trouble because we want to cut the circularity, but
> don't know "the start of the circle".

It might be possible to choose a canonical ordering so that we always
start at the same step in the cycle.  I haven't thought this through,
but maybe it is a line worth considering.  For example, say we chose a
canonical ordering of the direct children of a CU.  Then, among all CUs
so canonicalized, there would be a single choice for the first DIE in
the depth-first ordering of the whole CU tree that is the start of a
cycle, the same choice for all instances of an identical cycle.  
Is that true?

> So the problem comes from including the hash of all reference attributes
> in the construction of the "unique identifier object" for a attribute
> reference that points to a die. That means that references inside
> circular structures cannot really be completed since during the
> construction of the "unique identifier object" of reference to that
> circular structure itself it can never be fully resolved. 

Here you mean the references that are themselves steps along the cyclic
reference chain, right?  That is, not each and every reference that
appears inside a DIE tree that is part of a cycle.  For example, a
reference whose referent is some <base_type .../> DIE can always be
completed, because that referent has no other references.  Likewise, a
<const_type type=.../> can be completed if its type=... referent is a
base_type.  Likewise, a <pointer_type type=.../> whose referent is such
a const_type, etc.  i.e., any reference that is not itself part of a
cyclic reference chain.  I think that's what you meant, but I want
always to be sure we are being precise in understanding each other.

> So it seems that we should declare two "resolved" states for DIEs. One
> "tree resolved" that takes into account just the basic attributes plus
> all the children "tree resolved". And another "fully resolved" that
> all attributes, including reference values into account, plus "fully
> resolved" children.

That sounds familiar.  In some previous iterations of the 'struct entry'
thing with pending counts, I had multiple different kinds of pending
counts.  It all got very confusing and I went back and forth confounding
myself several times before settling on the apparently simpler approach
we have now.  That doesn't mean this isn't the right way to go.  But it
will take great care, and much better reasoning and commenting than I
did before, to make sure it all makes sense.  I'm sure what I did before
is not exactly what you are proposing, but it certainly was a way to
make all the code even harder to follow.

> A reference value then only needs its referent DIE to be "tree
> resolved" to create the "unique identifier object" for it.  That
> "breaks the circle" at the start of a reference attribute value so you
> don't need to track circularity anymore.

Ok, I think that makes sense.  But then we come back to the case of
<pointer_type type=.../>.  That has no children and no non-reference
attributes, so there is nothing but DW_TAG_pointer_type to contribute to
its hash value, and so all of them hash the same.

> The disadvantage is that you end up with two "pending queues" which
> seems to effectively mean you will scan the whole DIE tree twice (or at
> least calculate the hash over a DIE twice, without and then with taking
> reference values into account). 

This might not be as repetitive as it sounds.  You don't need to visit
the non-reference attributes a second time.  In "tree resolved" state,
you can take the complete hash of all the non-reference attributes, and
also hash just the names of the reference attributes, and combine those
together as a single hash value.  Then the second pass in "fully
resolved" state will just take the final hashes of the referent DIEs
(pointer-hashes of their .identity () values) and combine those with the
hash saved in the first pass--this yields the hash of an "attrs_pair".
You do then need to visit each child DIE again to combine their final
hashes into the "brood hash", but since each child is fully resolved by
then, that is just a (reduce 'identity children) iteration on the vector
(std::accumulate(c.begin(),c.end(),0,identity_hash) or something).

> And since some "tree resolved" DIEs might have the same hash, but are
> not equal if you take the value references into account, you don't
> have perfect hashes for the attribute reference values. But they only
> collide when referring to the same structural trees with all basic
> attribute in all included DIEs matching completely.

That sounds nice, but the "structural tree" <singleton_tag/> with "all
basic attributes" being the empty set is a common case.  Not that I have
any better plan.  The supposed plan as I've left it is just lots of both
negative and positive caching for these costly comparisons after every
instance of {pointer,reference,const,...}_type has hashed the same (one
distinct hash per tag) by this plan.  Maybe that really is good enough
if we address the cyclic cases in ways that avoid false-positive caching
(somehow that doesn't also abandon the correct caching, so we don't do
lots of repetitive costly comparisons while handling cycles).
But I just don't know.


Thanks,
Roland
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]