[WIP] New dwarf2 reader - updated 07-02-2001

Jim Blandy jimb@zwingli.cygnus.com
Tue Jul 31 13:35:00 GMT 2001


Daniel Berlin <dan@cgsoftware.com> writes:
> Jim Blandy <jimb@zwingli.cygnus.com> writes:
> > - MD5 generates a 128-bit checksum.  Any reason you only use the first
> >   32 bits in your hash?  It looks like checksum_die and
> >   checksum_partial_die both cut it off at eight digits.
> I use the first 64 bits, just like GCC does for checksumming DIE's and
> duplicate elimination.  Though i'll give you gcc also includes the
> name attached to the die. I'll happily make the two algorithms
> *exactly* the same if you like.

Eight hex digits * 4 bits per hex digit is 32 bits.  No?

    static inline char *
    checksum_die (struct die_info *die)
    {
      struct md5_ctx ctx;
      char *retval;
      int i;
      unsigned char checksum[16];
      md5_init_ctx (&ctx);
      process_full_die (die, &ctx);
      md5_finish_ctx (&ctx, checksum);
      retval = xcalloc (1, 9);
      for (i = 0; i < 4; ++i)
        sprintf (retval + (i * 2), "%.2x", checksum[i]);
      retval[8] = 0;
      return retval;
    }

The "MD5 never clashes" thesis is okay by me.  But to say that,
therefore, the first 32 bits of MD5 never clash, is not okay.

Using a function yielding n distinct hash values, the probability is
1/n that they clash, thus 1-1/n that they don't clash.  For p
independent pairs, the chances are (1-1/n)^p that none clash.  For a
set of d dies, there are d(d-1) pairs.  So in our case, using 32 bits
of MD5 for 10000 distinct dies, the chance of no clashes are
(1-1/(2^32))^(10000(10000-1)), or 0.977.  Thus a 2.3% chance of a
clash.  This sucks.

Using 64 bits, as you intended, is much better, but why not use the
full 128?  Honestly, why play this kind of game at all?  It's not that
hard to do the job perfectly --- just use the significant die contents
themselves as the tree key.  No collisions, no hash function, simpler.

> > - Why are you using a splay tree keyed by MD5 hash values, instead of
> >   a hash table?  The only advantage of a splay over a hash table is
> >   ordered retrieval, but your keys don't have any meaningful order.
> I hate the hash table interface for libiberty, it's annoying to use,
> IMHO.
> Other than that, no reason. I was meaning to make a nice generic
> dictionary interface, and talked with DJ about whether i could just
> use libdict, but since it's not GPL, and i don't have time to
> duplicate it's nice interface right now, i just went with what was
> easiest to program.

Okay.

> > - Assuming MD5 is perfect, you don't checksum all attribute forms
> >   (DW_FORM_flag, for example).  This means that you can get false
> >   duplicates, if two dies differ only in (say) a flag's value.  How is
> >   read_comp_unit_dies set up to tolerate that?
> I only checksummed the  stuff gcc generates (since i based the
> checksumming code on gcc's), and I forgot that gcc
> doesn't generate some forms.  

GCC does generate DW_FORM_flag:

	.byte	0x2	; uleb128 0x2; (abbrev code)
	.byte	0x2e	; uleb128 0x2e; (TAG: DW_TAG_subprogram)
	.byte	0x1	; DW_children_yes
	.byte	0x1	; uleb128 0x1; (DW_AT_sibling)
	.byte	0x13	; uleb128 0x13; (DW_FORM_ref4)
	.byte	0x3f	; uleb128 0x3f; (DW_AT_external)
	.byte	0xc	; uleb128 0xc; (DW_FORM_flag)
	.byte	0x3	; uleb128 0x3; (DW_AT_name)
	.byte	0x8	; uleb128 0x8; (DW_FORM_string)
	.byte	0x3a	; uleb128 0x3a; (DW_AT_decl_file)
	.byte	0xb	; uleb128 0xb; (DW_FORM_data1)
	.byte	0x3b	; uleb128 0x3b; (DW_AT_decl_line)
	.byte	0xb	; uleb128 0xb; (DW_FORM_data1)
	.byte	0x27	; uleb128 0x27; (DW_AT_prototyped)
	.byte	0xc	; uleb128 0xc; (DW_FORM_flag)
	.byte	0x49	; uleb128 0x49; (DW_AT_type)
	.byte	0x13	; uleb128 0x13; (DW_FORM_ref4)
	.byte	0x11	; uleb128 0x11; (DW_AT_low_pc)
	.byte	0x1	; uleb128 0x1; (DW_FORM_addr)
	.byte	0x12	; uleb128 0x12; (DW_AT_high_pc)
	.byte	0x1	; uleb128 0x1; (DW_FORM_addr)
	.byte	0x40	; uleb128 0x40; (DW_AT_frame_base)
	.byte	0xa	; uleb128 0xa; (DW_FORM_block1)
	.byte	0x0
	.byte	0x0

> > - In read_comp_unit_dies, when you find a duplicate die, you skip to
> >   its sibling.  What if the parent die is identical, but the children
> >   dies differ?
> 
> I don't believe this is possible in any language.

How about this:

namespace X {
  namespace A {
    int m, n;
  }
}

namespace Y {
  namespace A {
    int o, p;
  }
}

The dies for the two 'A' namespaces have the name DW_AT_name
attribute, but they're clearly different.

Just in principle, this seems sloppy.  Dies are just arbitrary tree
nodes.  There's no reason to assume that if two parent nodes are
identical, we can skip the second one and all its children.



More information about the Gdb-patches mailing list