This is the mail archive of the mailing list for the GDB project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: (patch) hpjyg09: bcache optimizations

>>>>> "Jim" == Jim Blandy <> writes:
Jim> But this situation still sounds weird.  The bcache is a very
Jim> simple thing --- bcache.h and bcache.c total 289 lines of code.
Jim> You give it a string of bytes, and it either adds a copy of your
Jim> string to its hash table, or finds an identical copy already
Jim> present.  Either way, it hands you back a pointer to its stashed
Jim> copy.  So, a bcache is helpful if you expect a lot of duplicates.

This brings back memories.  

Many years ago, I discovered CVS had terrible hash behavior.  Date and
Revision strings were used as keys, and each character (of which there
were only digits, '.', and '/') was weighted equally.  I can't recall
exactly, but I think the original hash function resulted in ~10% bucket
utilization --- the remaining ~90% would not be used by any valid keys.
And even among the remaning ~10%, the distribution was non-uniform.

I changed it to one of the functions compared in the Red Dragon book's
discussion of hash functions and their suitability for use for hashing
identifiers (which tend to have a lot of duplication/redundency). This
change resulted in a uniform distribution across all buckets and a
cooresponding performance improvement.

For the longest time, I used to keep a histogram plot of the original
and new hash functions in a folder, and would show them to anyone who
would listen to the tale: switching from a nieve to a complex data
structure will not automatically result in a performance improvement.

So I agree with Jim, fixing bcache to use a better hash table is a
much better approach than microoptimizing the existing hash function
by making it an inline function or a macro.  Many thanks for spending
the time and tracking down the root cause.

For reference, here is hash function I used:

    unsigned int 
    hashpjw(const char* x)
      unsigned int h = 0;
      unsigned int g;

      while (*x != 0)
        h = (h << 4) + *x++;
        if ((g = h & 0xf0000000) != 0)
          h = (h ^ (g >> 24)) ^ g;
      return h;

It will have to be modified slightly for use in bcache, but it may
result in acceptable behavior in this case as well.


J.T. Conklin
RedBack Networks

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]