This is the mail archive of the elfutils-devel@sourceware.org mailing list for the elfutils project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: dwarf_edit_output testcase sanity checking

From: Mark Wielaard <mjw at redhat dot com>
To: elfutils-devel at lists dot fedorahosted dot org
Date: Fri, 12 Nov 2010 13:16:20 +0100
Subject: Re: dwarf_edit_output testcase sanity checking
On Thu, 2010-11-11 at 12:07 -0800, Roland McGrath wrote:
> Does the first hash ignore reference attributes completely as if they
> weren't there?  Or does the hash function factor those attribute names
> in too, and just omit any effect of the values (referents)?  It doesn't
> matter either way to the correctness of the algorithm.  In practice, it
> probably doesn't change the collision rate on real-world data to an
> important degree.  For example, <pointer_type byte_size=8/> (which is
> "void *") vs <pointer_type byte_size=8 type="#refN"/> ("something *")
> would or wouldn't collide in the first hash.  But all the different ones
> with different "#refN"s are going to collide anyway.
>
> It might make sense to completely split the final storage of attribute
> maps into the basic set and the reference set.  Then it becomes very
> straightforward to completely finalize the basic set in the first walk.

Right, so for consistency ignoring the reference attributes types
completely seems simplest. Although I admit I hadn't realized there was
also the canonicalized attribute map hacks.

> It may even make sense to completely finalize any whole DIEs that have
> no references in the first pass.  I'm not sure off hand if that really
> saves any storage or computation, as opposed to just moving the same
> storage allocation and the same computation from the second pass to the
> first pass.  I guess there's always a little: you don't need any
> "pending_entry" object with either pointers to the attribute sets or
> just storage of their hashes, you can just have the final pointer
> directly to the collector die_info_pair.

To my surprise most DIEs actually do have reference attributes. A quick
random sampling of some test binaries showed ~2/3 of all DIEs have one.
Only ~1/3 of the DIEs are pure basic attributes. Still I think it makes
sense to finalize these immediately. Since for the second pass location
in the tree doesn't matter, we can then just take the list of ~2/3rd
pending DIEs and walk over them directly.

> I concur with the suspicion that the inclusion of the children in the
> first hash will tend to make the collisions rare enough.  We can see how
> it goes in the actual data.  The only case that comes to mind for "long
> simple reference chains" is the several type DIEs in "foo *****" and such.

hmmm, ** void and ** char might occur frequently. So one extra pass to
catch those might be in order of that matters.

We could even chase the whole reference chain if we could prove it
always terminates in some basic attributes only DIE. My suspicion is
that the cycles only occur because when we chase references and children
at the same time. But I don't think the dwarf format guarantees it, it
just looks to be true for all sane type systems I can imagine (where a
reference type cannot directly (recursively) point to itself (only
through children of structure types). So I don't think we can rely on
that in general.

> > So in the above scheme the final hash for such singleton_tags with just
> > one reference attribute is based on the (first) hash of the type DIE. So
> > if they point at the same basic_type they would hash the same, just like
> > when they point to a structure_type with the same children. But they
> > will hash differently if the basic_types are not equal or the
> > structure_type has different children.
> 
> I see.  This does look promising.  I'm not sure I understand the
> ramifications of the algorithm clearly enough to be sure whether it's
> sound and/or complete.  By "complete", I mean that it does guarantee
> perfect detection of all duplicates.

I think it does guarantee that.
Each duplicate DIE hashes to the same value.

>   By "sound" I mean that, if the
> hash function produced infinite bits (i.e. not considering collisions
> produced by the actual hash functions' bit swizzling techniques), it
> would never hash two DIES the same that are not truly equivalent.  As
> always, it's trying to think through the complex cyclic cases that makes
> my brain fall out.  Do you think you can state a proof of its soundness?

It doesn't guarantee that.
Although I think in practice the chances of that happening are low.
It guarantees that only when taken the normal structure of the DIE tree
into account. But it only see through one (or two in the three pass
variant) of the reference equivalent. So two DIEs that have a reference
attribute that refers to identical looking "tree equivalent" DIEs, which
again have reference attributes to DIEs that aren't equivalent will hash
the same. Using the three pass variant will detect one extra reference
attribute chain step.

If we could prove (or maybe dwarflint could check, but as said above I
don't think the dwarf spec guarantees it) that all reference attribute
chains (ignoring DIE children) are really DAGs (directed acyclic
graphs), then we could make the second pass hash unique too, which would
give us the guarantee theoretically, I think.

> Also, do you think you can calculate the formal algorithmic complexity?

I believe worse case (ignoring hash collisions) you have to walk over
the whole DIE tree once for each pass. So it should just be O(p * n),
where p is the number of passes and n is the number of DIEs in all CUs.
That assumes all partial results we store per DIE/attribute in temporary
tables are just constant lookups.

> Since the hash will not have infinite bits, there will always be the
> theoretical possibility of collisions.  Collisions in the first hash are
> straightforward to resolve, because the comparisons without regard to
> references are easy to understand.  When two DIEs do match completely
> without regard to any references, then it becomes possible to have a
> collision in both the first and second hashes.  What's the plan then?

We will have to compare the two DIEs completely, including children and
references. The hope is that the multi-pass hashing is pretty fast
itself, and strong enough to make that happen rarely.

> Aside from the "less likely than roving bands of hoodlums" criterion,
> we do need the algorithm to be formally sound.  Unlike dwarfcmp, for
> dwarf_output we don't really need it to be formally complete, just
> "complete enough" that uncaught duplicates don't appear in practice
> with real-world data, or are not large, or are exceptionally rare.

Since there will be hash-collisions we have to do a complete comparison
in such cases to make sure it is sound. But that full comparison can be
"dumb" and just do the comparison one-on-one for each of the DIEs
children and reference attributes. With dwarfcmp we would like to catch
identical DIEs which might have equal, but different length, reference
chains. We can ignore those tricky cases where we just have two concrete
DIEs who happen to hash together. Just proving that they are
structurally completely identical or not is enough to be sound (it just
means we aren't complete in the sense that two identical DIEs which hash
together but have different, but semantically identical, circular
reference chains won't be actually recognized as truly equal).

Cheers,

Mark
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]