[WIP] New dwarf2 reader - updated 07-02-2001
Jim Blandy
jimb@zwingli.cygnus.com
Tue Jul 31 13:35:00 GMT 2001
Daniel Berlin <dan@cgsoftware.com> writes:
> Jim Blandy <jimb@zwingli.cygnus.com> writes:
> > - MD5 generates a 128-bit checksum. Any reason you only use the first
> > 32 bits in your hash? It looks like checksum_die and
> > checksum_partial_die both cut it off at eight digits.
> I use the first 64 bits, just like GCC does for checksumming DIE's and
> duplicate elimination. Though i'll give you gcc also includes the
> name attached to the die. I'll happily make the two algorithms
> *exactly* the same if you like.
Eight hex digits * 4 bits per hex digit is 32 bits. No?
static inline char *
checksum_die (struct die_info *die)
{
struct md5_ctx ctx;
char *retval;
int i;
unsigned char checksum[16];
md5_init_ctx (&ctx);
process_full_die (die, &ctx);
md5_finish_ctx (&ctx, checksum);
retval = xcalloc (1, 9);
for (i = 0; i < 4; ++i)
sprintf (retval + (i * 2), "%.2x", checksum[i]);
retval[8] = 0;
return retval;
}
The "MD5 never clashes" thesis is okay by me. But to say that,
therefore, the first 32 bits of MD5 never clash, is not okay.
Using a function yielding n distinct hash values, the probability is
1/n that they clash, thus 1-1/n that they don't clash. For p
independent pairs, the chances are (1-1/n)^p that none clash. For a
set of d dies, there are d(d-1) pairs. So in our case, using 32 bits
of MD5 for 10000 distinct dies, the chance of no clashes are
(1-1/(2^32))^(10000(10000-1)), or 0.977. Thus a 2.3% chance of a
clash. This sucks.
Using 64 bits, as you intended, is much better, but why not use the
full 128? Honestly, why play this kind of game at all? It's not that
hard to do the job perfectly --- just use the significant die contents
themselves as the tree key. No collisions, no hash function, simpler.
> > - Why are you using a splay tree keyed by MD5 hash values, instead of
> > a hash table? The only advantage of a splay over a hash table is
> > ordered retrieval, but your keys don't have any meaningful order.
> I hate the hash table interface for libiberty, it's annoying to use,
> IMHO.
> Other than that, no reason. I was meaning to make a nice generic
> dictionary interface, and talked with DJ about whether i could just
> use libdict, but since it's not GPL, and i don't have time to
> duplicate it's nice interface right now, i just went with what was
> easiest to program.
Okay.
> > - Assuming MD5 is perfect, you don't checksum all attribute forms
> > (DW_FORM_flag, for example). This means that you can get false
> > duplicates, if two dies differ only in (say) a flag's value. How is
> > read_comp_unit_dies set up to tolerate that?
> I only checksummed the stuff gcc generates (since i based the
> checksumming code on gcc's), and I forgot that gcc
> doesn't generate some forms.
GCC does generate DW_FORM_flag:
.byte 0x2 ; uleb128 0x2; (abbrev code)
.byte 0x2e ; uleb128 0x2e; (TAG: DW_TAG_subprogram)
.byte 0x1 ; DW_children_yes
.byte 0x1 ; uleb128 0x1; (DW_AT_sibling)
.byte 0x13 ; uleb128 0x13; (DW_FORM_ref4)
.byte 0x3f ; uleb128 0x3f; (DW_AT_external)
.byte 0xc ; uleb128 0xc; (DW_FORM_flag)
.byte 0x3 ; uleb128 0x3; (DW_AT_name)
.byte 0x8 ; uleb128 0x8; (DW_FORM_string)
.byte 0x3a ; uleb128 0x3a; (DW_AT_decl_file)
.byte 0xb ; uleb128 0xb; (DW_FORM_data1)
.byte 0x3b ; uleb128 0x3b; (DW_AT_decl_line)
.byte 0xb ; uleb128 0xb; (DW_FORM_data1)
.byte 0x27 ; uleb128 0x27; (DW_AT_prototyped)
.byte 0xc ; uleb128 0xc; (DW_FORM_flag)
.byte 0x49 ; uleb128 0x49; (DW_AT_type)
.byte 0x13 ; uleb128 0x13; (DW_FORM_ref4)
.byte 0x11 ; uleb128 0x11; (DW_AT_low_pc)
.byte 0x1 ; uleb128 0x1; (DW_FORM_addr)
.byte 0x12 ; uleb128 0x12; (DW_AT_high_pc)
.byte 0x1 ; uleb128 0x1; (DW_FORM_addr)
.byte 0x40 ; uleb128 0x40; (DW_AT_frame_base)
.byte 0xa ; uleb128 0xa; (DW_FORM_block1)
.byte 0x0
.byte 0x0
> > - In read_comp_unit_dies, when you find a duplicate die, you skip to
> > its sibling. What if the parent die is identical, but the children
> > dies differ?
>
> I don't believe this is possible in any language.
How about this:
namespace X {
namespace A {
int m, n;
}
}
namespace Y {
namespace A {
int o, p;
}
}
The dies for the two 'A' namespaces have the name DW_AT_name
attribute, but they're clearly different.
Just in principle, this seems sloppy. Dies are just arbitrary tree
nodes. There's no reason to assume that if two parent nodes are
identical, we can skip the second one and all its children.
More information about the Gdb-patches
mailing list