This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: hash table subtleties


Jay Glascoe <jglascoe@jay.giss.nasa.gov> writes:

 > On Wed, 28 Oct 1998, Tel wrote:
 > 
 > > > I'd like to also repeat what I mentioned elsewhere - namely if you
 > > > shrink the tables on deletions you can easily get the bad behavior of
 > > > resizing every few operations (on a mix of insertions & deletions)
 > > > causing the hash tables to give O(n) behavior instead of O(1).
 > > > They'll be slower than an alist, let alone a balanced tree.
 > > 
 > > Put some hysteresis into the grow and shrink. For example, when
 > > utilisation is greater than 3, grow, when it is less than 1, shrink.
 > > This means that resize cannot possibly occur every few operations.

Basically, but not exactly.  It depends on how much you grow the
tables.  If you grow & shrink by a factor of 3, then your example
wouldn't yield any hysteresis.  For a factor of 2 it'd be ok, but
after shrinking the table won't be in the previous state, so I'd
recommend 3/4.

 > my tables grow (double the number of buckets) when the mean nonempty
 > bucket size exceeds 3, and shrink (halve the number of buckets) when the
 > mean nonempty bucket size is less than 3 * 119/256 = 1.395 (3 * 1/2 is too
 > big, 3 * 1/4 is too small, ...  3 * 951/2048 is very close)
 > 
 > The magic ratio is chosen so that the expected mean nonempty bucket size
 > after upsize is the same as it is after downsize.  (and then, e.g. if a
 > user inserts just enough entries for the table to grow to 4096 buckets, he
 > must delete half of them before it will shrink back down to 2048). 

Yes, this is the property you want, which would be that the
min_avg_bucket_size = max_avg_bucket_size/growth_factor^2.  So why do
you say 3/4 is too small?  If the avg bucket size is 3/4 & you half
the table size, then the mean bucket size will be 3/2 = what you have
after doubling when you've hit the max_avg_bucket_size of 3.

The thing I'm most concerned about is averaging the nonempty bucket
sizes instead of all the bucket sizes.  If you resize on average
bucket size then in the worst case (where everything hashes to the
same bucket) you'll have 1 bucket with 3*N items & a hash table of
size N, yielding O(N) lookup times.  If you resize on average
*nonempty* bucket size, then your worst case is O(2^N)!!!  This is the
difference btw bad performance in the worst case & crashing with out
of memory in the worst case.  If the input data all lands in one
bucket, then once that bucket passes max_avg_bucket_size, you'll
double the hash table for every insertion.

This will happen (slightly slower) whenever you end up with a small
bounded number of hash values.

When the hash table is well behaved it won't make a difference.  When
the hash table is missing buckets (probably common), using
nonempty_bucket_size_average will resize a little sooner, but it
probably doesn't make much of a difference.  When the hash function is
clustering, then using nonempty_bucket_size_average will rehash much
sooner, keeping the alists shorter, but at the expense of having a
larger vector.  In the worst cases, using nonempty_bucket_size_average
causes catastrophic failure whereas using bucket_size_average will be
slow but not catastrophic.  Is the probably marginal speedup worth the
risk of catastrophic failure?

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il